False alarms due to DNS resolution failure

Regrettably we woke up some of our beloved sysadmin users last night – those whom reside in the Asia/Pacific time zones at least. A number of false alarms were issued by Wormly – owing to a recent update to our asynchronous DNS resolution intended to improve the performance of our Uptime Monitoring system.

As it turns out there are still some issues with this library when operating under load, and this resulted in 3 instances of spurious “Timeout while contacting DNS servers” errors between Tue, 23 Jul 2013 21:15:21 +0000 and Wed, 24 Jul 2013 21:05:00 +0000. The deploy was reverted at this time.

A total of 528 erroneous SMS & Phone Call alerts were sent as a result.  These have been removed from the affected accounts to ensure that users are not charged for these messages.

We’re acutely aware that false alarms – especially those that unnecessarily wake you up – are an enormous failure on our part, and for this we are deeply sorry. We have already increased the sensitivity of our internal monitoring of these processes so that we could catch anomalies like these faster.  We are also working to improve our integration testing environment to add tests which might reveal these kinds of problems pre-deploy.

Please do contact support if you have any other concerns surrounding this incident – we want to make things right for you.

Once again, we’re really sorry to have failed you – and we’re working hard to minimize the risk of breakages like these and others in future.

Filed under: Announcements — Jules @ 19:56 - July 25, 2013 :: Comments Off on False alarms due to DNS resolution failure