Incident report: Metrics Outage May-24 2016

On 24/May/2016 between 13:33 and 14:00 UTC the Wormly Metrics system experienced a serious outage. During the 27 minutes of the outage, no Metrics were accepted by our system; and depending on your integration methods you would likely have seen HTTP connect timeouts whilst attempting to submit Metrics via our API.

As a consequence no Metrics-sourced Alerts were sent – nor is graph data available – during that time period. Additionally, at the moment service was restored, users who have enabled “dead man’s switch” alerts would likely have received false positives indicating that their hosts were failing to report Metrics to the Wormly service. However the fault was clearly on our end, not yours.

In the 11 months since we launched Metrics this is the first outage of more than a few seconds that we’ve experienced, and we’re certainly not happy about it. The cause was a routine upgrade of the Percona XtraDB cluster software that powers the Metrics data persistence layer. After the upgrade, it entered a failure state that we had not previously seen in any of our dev, staging or production environments.

We are working with the vendor to better understand the root cause and avoid any future recurrence. We have also updated our operations run-book detailing the scenario and what we learned from the (admittedly slow) resolution process.

Please note that our Uptime Monitoring service was unaffected, and all uptime monitoring, reports & alerts continued to operate normally.

We are very sorry for the inconvenience this has caused you. We know and value the high level of trust that you place in us to always be “up” so we can help you do the same.

Filed under: Incidents — Jules @ 1:39 am - May 25, 2016 :: Comments Off

Going SPDY with mod_spdy and Amazon Linux / CentOS

At Wormly we’re ardent believers in the maxim that performance is a feature.  Google’s SPDY protocol is an experimental extension to HTTP with the goal of reducing the latency inherent in the HTTP protocol.  Or, rather more simply: “Get those pages delivered to the user faster!

From our perspective the biggest benefits come from stream multiplexing and header compression.

Multiplexed streams allow the browser to simultaneously request a bunch of resources over a single TCP connection – particularly handy because it avoids the latency penalty of setting up multiple TLS sessions for HTTPS.

Header compression (both request and response headers) helps further reduce the amount of data that must traverse the pipes of the internet.  These headers are probably bigger than you think – take an example request to wormly.com:

That’s 348 bytes just in the request header. And most users’ upstream bandwidth is rather more limited than your web servers’.

SPDY is currently supported by Firefox and Chrome – on Windows, Linux, Mac and Android platforms.  As it happens more than 70% of our users are on Chrome, so the benefits of adopting SPDY will be seen widely.

Enter mod_spdy

Google has released a very simple to use Apache extension to add SPDY support to your web server. We use Apache to serve www.wormly.com and we were also using mod_php to serve the parts of our web application which are written in PHP.

This actually presents a problem with mod_spdy, because its’ implementation is threaded, and PHP is not guaranteed to be thread safe.  The recommended approach is to replace mod_php with a FastCGI interface to serve any PHP requests from a dedicated pool of PHP processes – leaving the Apache processes free of any trace of the PHP interpreter.

We actually intended to switch to this sort of setup some time ago – and to switch from Apache to Nginx in the process.  However a still-unresolved limitation in Nginx whereby all FastCGI requests are buffered – meaning PHP cannot ensure that data is immediately flushed to the client – has so far prevented us from doing so.

Happily the availability of mod_spdy pushed us to find a solution; and ultimately we found it in mod_fastcgi, which does support unbuffered connections to the upstream PHP server.

Switching to FastCGI also allowed us to switch Apache to the Worker MPM rather than the Prefork MPM – and the benefits of these two changes significantly reduced the volatility of response times through our web application.  So we had a win even before deploying SPDY!

After this, installing mod_spdy is as simple as:

Restart Apache, and you’re done.

Note that Google’s included package management handles automatic updates of the mod_spdy package.

It should be noted that mod_fastcgi is not commonly included in CentOS / Red Hat Enterprise linux distributions and their derivatives (including Amazon Linux).  You can either compile it from source or use a 3rd party repository which does include it: for example the RPMForge, DAG or City-Fan.

Filed under: Meta,Servers,SSL — Jules @ 3:44 am - August 24, 2013 :: Comments Off

False alarms due to DNS resolution failure

Regrettably we woke up some of our beloved sysadmin users last night – those whom reside in the Asia/Pacific time zones at least. A number of false alarms were issued by Wormly - owing to a recent update to our asynchronous DNS resolution intended to improve the performance of our Uptime Monitoring system.

As it turns out there are still some issues with this library when operating under load, and this resulted in 3 instances of spurious “Timeout while contacting DNS servers” errors between Tue, 23 Jul 2013 21:15:21 +0000 and Wed, 24 Jul 2013 21:05:00 +0000. The deploy was reverted at this time.

A total of 528 erroneous SMS & Phone Call alerts were sent as a result.  These have been removed from the affected accounts to ensure that users are not charged for these messages.

We’re acutely aware that false alarms – especially those that unnecessarily wake you up – are an enormous failure on our part, and for this we are deeply sorry. We have already increased the sensitivity of our internal monitoring of these processes so that we could catch anomalies like these faster.  We are also working to improve our integration testing environment to add tests which might reveal these kinds of problems pre-deploy.

Please do contact support if you have any other concerns surrounding this incident – we want to make things right for you.

Once again, we’re really sorry to have failed you – and we’re working hard to minimize the risk of breakages like these and others in future.

Filed under: Announcements — Jules @ 7:56 pm - July 25, 2013 :: Comments Off

Hit “/” for Global Search and Go Faster!

TL;DR:

Hit the “/” key and start typing part of a host name, URL, email, phone number then hit “ENTER”. Or “TAB, g” to see graphs.

You might not know it, but Wormly has many keyboard shortcuts available: (Hit ? to see them on any page within the app)

Keyboard legend

Those of you who prefer to avoid the mouse or trackpad have probably already discovered these.  Today, however, we’ve shipped a major update to the keyboard UI, notably around the search facility.

Previously you could hit “j, j” to bring up a simple jump-to-host list.  Whilst that shortcut still works, we’ve switched to “/” as the trigger because the forward slash has become the standard way to invoke site-searches around the web.

And our search facility now also searches your Sensors and Alert Contacts:

We’ve also added some extra hotkeys which allow you to jump directly to a hosts’ Graphs and Uptime Reports. When you’ve selected the host of interest – either with an exact search term or the arrow keys – you can hit TAB, g to jump straight to that hosts’ Graphs, or TAB, u, for its Uptime Reports:

Note that all of these hotkeys and behaviours are also found in the search box on the My Hosts page. No need to hit slash there; since the search box has focus on load. Happy searching, power users!

A couple more shortcuts we added

  • When you’re editing a monitoring sensor, hit ENTER to run an Instant Test, and CONTROL-ENTER to save your changes.
  • When you’re on any page belonging to a host (e.g. graphs, uptime reports, editing sensors, configuring alert groups, etc), simply hit “h” to jump to the Host Overview page.  You will find this hotkey – and others – documented when you hit “?” to view the Keyboard Shortcut legend.

 

Filed under: Announcements,Features — Jules @ 2:34 am - April 26, 2013 :: Comments Off

See who changed what / when with Activity Log

We’ve just shipped a feature much requested by many of our larger customers; every change made to your monitoring configuration is now logged.

In addition to a brief explanation of the event, the timestamp and responsible user is logged.

Users can also add notes to each activity explaining why the change was made.

You’ll find Activity Log linked from My Account.

 

Filed under: Announcements,Features — Jules @ 7:54 pm - April 3, 2013 :: Comments Off

Monitor SSL Certificate Expiration

[Also see our FAQ entry on SSL Certificate Monitoring here.]

An outage caused by an expired SSL certificate is the last thing your DevOps team want their inboxes filled with.

Although, happily, the odds are pretty good that they’ll already be awake when it happens, given that certificates are generally issued with an expiration time identical to the moment the certificate was signed. We presume they weren’t purchasing SSL Certs in their sleep.

Still – much better to ensure that the right people are alerted before expiration.  And now Wormly can help you with this small, but critical, task.

You will find a new parameter available in our HTTP Sensor – simply specify the minimum number of days validity that a certificate must possess.  Alerts will be generated if that threshold is exceeded.

Monitor SSL Certificate Expiration

Note that all certificates presented by your server (i.e. the complete certificate chain) will have their expiration dates checked. So a soon-to-expire intermediary certificate won’t go unnoticed, either.

Filed under: Announcements,Features — Jules @ 11:08 pm - February 7, 2013 :: Comments Off

High Definition Monitoring – Test every 5 seconds

Today we announce Clarity, our new test platform. Clarity offers drastically lower test intervals, right down to 5 seconds – a level that none of our competitors can match.

With Clarity, you can be confident that even the most isolated of failures will be detected and reported. Mission critical systems can now get the level of attention they deserve.

Clarity represents a significant evolution of our core test product, and further improves the already impressive fault tolerance of our distributed monitoring system.

The improved performance of Clarity has allowed us to double the test frequency for all customers on our current plans to 30 seconds, at no extra cost.

Gold, Platinum and Enterprise plans now include High Definition monitoring sensors, at no extra cost.

As always, you can view full pricing details from within your account, or choose from our other plans to find one that suits your needs.

Filed under: Announcements,Features — Jules @ 6:11 pm - October 2, 2012 :: Comments Off

Manage your on-call schedule with new Alert Groups!

Have you worked out which members of your devops team will be on-call this Christmas? Don’t worry if you haven’t, because we’ve just made it much easier to manage who gets alerted when things go wrong.

You can now create multiple Alert Groups – allowing you to assign groups of people to receive alerts for different hosts, and to swap between them easily.

Each of your existing Host-Specific alert settings has been migrated to a new Alert Group, and your default alert recipients are now your Default Alert Group.

We hope you enjoy this much requested feature, and would love to hear your feedback on our implementation.

This deployment did involve significant changes to our alert system, so do let us know if you spot anything that doesn’t look right.

Merry Christmas!

Filed under: Announcements,Features — Jules @ 9:01 am - December 14, 2011 :: Comments Off

Shortcut keys, Global Alert Mute & Auto-resume

We’ve pushed out a major new release today, featuring:

Keyboard shortcuts so you can monitor your apps like a boss! Hit “?” on any page to view the legend:

Keyboard shortcuts

Automatic resumption of paused hosts & global alert mute.

auto-resume

Filed under: Announcements,Features — Jules @ 9:26 am - December 2, 2011 :: Comments Off

Auto-Resume & Global Alert Mute

Need to temporarily stop monitoring a host, but don’t want to forget to re-enable it?

Now you can!  When you hit the pause button next to a host on the My Hosts page, you will be prompted to specify a time interval after which monitoring will automatically resume.

Pause monitoring host & auto resume

Plus, you can now mute all alerts – great for when your data center has been struck by an earthquake and your phone is ringing off the hook as Wormly yells at the poor sysadmins.

You will find Global Alert Mute on the Alerts page, and it’s offered with the same Auto-Resume functionality described above, so you won’t arrive on Monday morning only to discover that Joe left all of the alerts muted over the weekend when the hurricane arrived.

Global alert mute

Filed under: Announcements,Features — Jules @ 9:23 am - :: Comments Off
Next Page »

Never Offline

A blog hosted by James Peterson, director of insights @ Wormly

On a semi-regular basis James will be trying to demonstrate that website infrastructure really is an exciting topic, and that your users really do care about the uptime & speed of your website.