Uptime Monitoring Outage July 17 21:20 – 21:35 UTC

This evening we experienced a partial outage of our Uptime Monitoring network during a 15 minute window which coincided with the outage at Cloudflare (a provider of CDN and reverse proxy services).

This occurred between 21:20 and 21:35 UTC, July 2017 2020.

During this time, uptime tests were not completed for a majority of our users, and accordingly you will notice the missing data period in your uptime reports and performance graphs.

We do sincerely apologise for any inconvenience that today’s outage has caused you.

Our preliminary investigation suggests that an overload condition occurred when the significant proportion of our users who use Cloudflare on their network edge simultaneously experienced failure conditions. This led to a backlog of test confirmations and a high alert load generated by Wormly in response.

Rest assured that when these sorts of failures occur, we learn a great deal and apply that knowledge to improve the quality and reliability of our infrastructure.

We are currently formulating improvements to our cluster consensus protocol to reduce the likelihood of such a “failure spiral” occurring again in future.

Even a few seconds of downtime in a year is too much for us, and we’ll always keep striving to keep it as close to zero as possible!

Filed under: Downtime,Improving Uptime — Jules @ 15:00 - July 18, 2020 :: Comments Off on Uptime Monitoring Outage July 17 21:20 – 21:35 UTC