Partial Uptime Monitoring Outage November 30 06:35 – 14:43 UTC

During this period we experienced a failure of our Uptime Monitoring service which resulted in periods of failures not being detected for some customers; i.e. “false negatives”.

The root cause was relatively simple to uncover; a bug introduced in a code deploy that morning which did not trigger failures in our CI pipeline, nor our monitoring system.

Needless to say, our attention has been focussed on investigating why both our CI and platform monitoring systems did not reveal this problem sooner; an 8 hour failure window is most certainly not acceptable to either us or our customers.

Rest assured that when these sorts of failures occur – humbling as they are – we learn a great deal and apply that knowledge to improve the quality and reliability of our infrastructure.

Given our track record for exceptional reliability over the past 15+ years, we are very disappointed by this incident, and unreservedly apologise to all customers, both those impacted and those who were not.

Please don’t hesitate to contact support if you have any further queries or concerns.

Filed under: Downtime — Jules @ 09:11 - December 1, 2020 :: Comments Off on Partial Uptime Monitoring Outage November 30 06:35 – 14:43 UTC

Uptime Monitoring Outage July 17 21:20 – 21:35 UTC

This evening we experienced a partial outage of our Uptime Monitoring network during a 15 minute window which coincided with the outage at Cloudflare (a provider of CDN and reverse proxy services).

This occurred between 21:20 and 21:35 UTC, July 2017 2020.

During this time, uptime tests were not completed for a majority of our users, and accordingly you will notice the missing data period in your uptime reports and performance graphs.

We do sincerely apologise for any inconvenience that today’s outage has caused you.

Our preliminary investigation suggests that an overload condition occurred when the significant proportion of our users who use Cloudflare on their network edge simultaneously experienced failure conditions. This led to a backlog of test confirmations and a high alert load generated by Wormly in response.

Rest assured that when these sorts of failures occur, we learn a great deal and apply that knowledge to improve the quality and reliability of our infrastructure.

We are currently formulating improvements to our cluster consensus protocol to reduce the likelihood of such a “failure spiral” occurring again in future.

Even a few seconds of downtime in a year is too much for us, and we’ll always keep striving to keep it as close to zero as possible!

Filed under: Downtime,Improving Uptime — Jules @ 15:00 - July 18, 2020 :: Comments Off on Uptime Monitoring Outage July 17 21:20 – 21:35 UTC

The Big Guys Fall Hardest: Skype Outage

Skype, an essential communication tools for millions of individuals and businesses worldwide has been unable to authenticate users during the past 14 hours, rendering the service unusable.

14 hours – and counting. One can scarcely imagine the magnitude of the technical failure that causes such a lengthy outage.

Although Skype offers paid-for, business critical services including inbound geographic number routing and outbound PSTN dialling, they have long – and wisely – avoided any commitment to deliver emergency call services. And you can understand their reluctance to start now.

This event also highlights the challenge of keeping customers informed; a typical Skype user almost nevers dials www.skype.com into their browser, so how to get the word out about the outage and status updates?

Luckily (or not), many major media outlets are covering the issue more than adequately.

Fingers crossed for Skype’s engineers that they can effect a resolution soon.

Filed under: Downtime,Servers,Uptime — Jules @ 08:26 - August 17, 2007 :: Comments Off on The Big Guys Fall Hardest: Skype Outage