Incident report: Metrics Outage May-24 2016

On 24/May/2016 between 13:33 and 14:00 UTC the Wormly Metrics system experienced a serious outage. During the 27 minutes of the outage, no Metrics were accepted by our system; and depending on your integration methods you would likely have seen HTTP connect timeouts whilst attempting to submit Metrics via our API.

As a consequence no Metrics-sourced Alerts were sent – nor is graph data available – during that time period. Additionally, at the moment service was restored, users who have enabled “dead man’s switch” alerts would likely have received false positives indicating that their hosts were failing to report Metrics to the Wormly service. However the fault was clearly on our end, not yours.

In the 11 months since we launched Metrics this is the first outage of more than a few seconds that we’ve experienced, and we’re certainly not happy about it. The cause was a routine upgrade of the Percona XtraDB cluster software that powers the Metrics data persistence layer. After the upgrade, it entered a failure state that we had not previously seen in any of our dev, staging or production environments.

We are working with the vendor to better understand the root cause and avoid any future recurrence. We have also updated our operations run-book detailing the scenario and what we learned from the (admittedly slow) resolution process.

Please note that our Uptime Monitoring service was unaffected, and all uptime monitoring, reports & alerts continued to operate normally.

We are very sorry for the inconvenience this has caused you. We know and value the high level of trust that you place in us to always be “up” so we can help you do the same.

Filed under: Incidents — Jules @ 01:39 - May 25, 2016 :: Comments Off on Incident report: Metrics Outage May-24 2016