Visit Github if you cannot see the code below.
Today, Amazon’s US-EAST region suffered a network failure, resulting in at least the us-east-1a availability zone becoming unreachable from the internet.
This, somewhat predictably, took many of the web’s biggest names off the air with it.
Amazon has repeatedly reminded users of their cloud infrastructure that they need to do the hard work in order to take advantage of Amazon’s famed redundancy. Your application needs to sit across multiple availability zones rather than just being deployed at run-time to an arbitrary, solitary zone.
This is, of course, rather more difficult than it sounds. For Wormly it means keeping a hot standby running in an alternative zone. Today when our primary web cluster in the us-east-1a zone failed, our system detected this automatically failed over to the hot replica which was running in the us-east-1b zone.
Our total website downtime was kept to under 2 minutes, and our globally distributed monitoring system (which does not run on EC2, and is designed to tolerate failures of the web cluster) experienced no downtime at all.
Having to keep a hot replica running in (what is hopefully) a separate data center certainly isn’t the most cost effective approach – and it’s presumably not what DevOps were hoping to gain from the cloud revolution.
But for us the extra cost is worth it, not least for the fact that when major failures like this occur, our customers appreciate being able to log into Wormly and switch off the alarms which are ringing their phones off the hook.
Availability zones not sufficiently isolated?
Immediately after a failover, our system attempts to bring up a new hot replica in another availability zone, to ensure that we continue to maintain at n+1 cluster redundancy.
During the 30-odd minutes that US-EAST-1A remained offline, we noted that our system was unable to successfully bring any new EC2 instances up, in any US-EAST zone. This was quite an alarming revelation, and we await Amazon’s response on why this occurred.
Availability zones are supposed to be sufficiently isolated to ensure that failures do not cross zone boundaries. Our inability to bring new instances online meant that we were heavily exposed to further failures; there were no further hot replica’s available for us to fail over to. Luckily US-EAST-1B didn’t experience any failures during this window.
Who else do you rely on?
Another unfortunate side-effect of today’s outage was that it brought down Twilio‘s API. We use Twilio to send SMS message to our US customers, because they offer excellent speed and deliverability to US cell phones. We do have alternate routes in place, so during this outage we were able to automatically route around the problem.
This highlights the importance of having redundancy built into your choice of web services when building applications in the cloud.
We note that RightScale‘s website was unavailable during the outage. Given that many major sites use RightScale to manage their EC2 deployments, this presumably caused quite some consternation among their users.
The cloud is fragile, be prepared!
We’re delighted to announce two great new features today:
The Wormly SSL Testing Tool
Our newly deployed SSL Tester runs 60+ tests against your secure web server – in real time – reporting crucial compatibility, security and performance factors:
You can quickly share this information with your team using the permalink generated by each test. You will find this tool listed in our free tools section, or access it directly from the footer of any page site-wide.
Once-Off Downtime Scheduling
We’ve implemented a small, but often requested feature allowing one-off periods of host downtime to be scheduled:
To configure these periods, select Scheduled Downtime from any Host Overview page.
You can also now schedule a once-off Do Not Disturb period for your alert contacts via a similar interface.
Welcome to the new-look Wormly! A few things in the main navigation bar have changed, so let’s start with those:
- Settings has been renamed to Alerts, and it now contains only things that relate to your alerts.
- Billing has been replaced with My account, which now provides easy access to everything related to managing your account, including profile updates, permissions, billing, invoices and payments.
- The My Wormly page has been removed – everything we had there can be more intuitively access from elsewhere.
- The Log out button turned out to be a rarely used one, and as such it has been moved to the My account page. Oh and we’ve re-branded it Sign out, in keeping with the style of most web apps today.
- Support has been renamed Help, because it includes both our customer support contact form as well as our searchable FAQ section.
- Blog has been removed from the top nav bar; you can now find it in our new “fat-footer”, which contains direct links to many parts of the site that power users will appreciate.
New features and tweaks we hope you’ll enjoy
- Always-on-SSL. Every part of our site is now available only via https to ensure the integrity of your account. We’ve spent lots of time tweaking cache directives, spriting icons and even tuning SSL ciphers to maximize our performance over https. We trust that you’ll find the small performance penalty worthwhile.
- iPad-friendly. We’ve deployed some minor CSS tweaks to allow better use of Wormly from iPad, particularly when in portrait orientation.
- Bigger, better buttons. Accounting for the inexorable move of the web toward touchscreen devices, we’ve made most clickable items throughout the site both a) bigger and b) more obviously clickable.
- When configuring an HTTP sensor, hit Test settings. If a content match fails, you will be given the option to view the full HTTP response that we received from your server.
- From the host overview page, you can now rename a host by hovering your mouse over the host’ name.
- We’ve implemented a work around for this horrible webkit bug to improve pageload speed.
- In the HTTP sensor, POST data is now configured via a multi-line text area.
Please do let us know if we’ve broken anything for you, or if you’re confused or otherwise unhappy with any of these changes.
Alternately, if you just love what we’ve done please let us know anyway!
You might be familiar with this situation:
The firewall sitting in front of your server farm fails, taking your public web site, extranet, API server, mail server down with it.
You’re a proactive sysadmin, so you’ve setup Wormly monitoring for all of these servers with alerts configured to hit your cell phone at the first sign of trouble.
As a result, you’re getting streams of alerts all which are telling you the pretty much same thing: Fix that firewall! All of those beeping text messages might help wake you up when this (inevitably) happens at 4am, but for that purpose our phone call alerts are a much better bet.
We’re pleased to announce a new feature designed to prevent such a flood of alerts. You can now configure Host Dependencies to define what other hosts a given host depends on.
From a host’s overview page, click the Host Dependencies link to get started.
Otherwise, just remember one important tip: Ensure that the shortest downtime triggering alerts on a host is equal to or greater than the testing interval of all hosts it depends on.
We’re delighted to announce the immediate availability of some exciting new – and much requested – features. The highlights include:
HTTP POST requests.
This enables monitoring of all kinds of interactions including login pages, registration forms, web services & APIs among many others.
Custom request headers.
You can now override any HTTP request header sent by our monitoring system. For example, you might override the Host: header in order to monitor a content distribution network / proxy. Or you might like to set the Referer: header to monitor a search optimized landing page.
Extended content match.
Allows you to match wanted and unwanted content strings in HTTP response headers. For example, you might monitor the correct functioning of a Location: […] redirect by verifying that the correct response headers were sent by your application.
In addition to our HTTP sensor enhancements, we’ve added support for encrypted SSL and TLS connections for monitoring SMTP and IMAP servers, and have also enabled encryption for generic TCP Request sensors.
This will enable you to monitor a whole range of secured TCP services that we were not previously able to establish a connection with.
All of these new features are visible when you add new sensors or modify existing sensors. Naturally no extra costs apply to these features.
We’ve also rolled out a number of usability and performance improvements, with particular focus on the monitoring sensor creation and management process. We hope you’ll find using these new features a breeze!
Naturally we’d love to hear your feedback on these features – and indeed all aspects of our service. Drop us a line via the support desk and let us know how we’re doing.
Keep your eye out for loads more new functionality in the new year!
Exciting times abound as we prepare to move our data processing and web serving infrastructure into Amazon’s EC2 Cloud. This move has become increasingly necessary with the substantial growth Wormly has seen in the past 18 months. Moving to the cloud allows us to offer a substantially more robust, higher availability system.
We will be performing the cut-over on Sunday October 17 at 11:30am (GMT+1100). We expect wormly.com will be unavailable for three hours during this period, which means you will not be able to modify monitoring configuration or view reports during this time.
Rest assured, though, that both uptime and server health monitoring will continue unaffected during this time. All data samples will be accrued and all alerts will be sent as normal.
We appreciate your patience during this time, and trust that the extra scalability and reliability that this move brings will be worth it.
If you have any queries at all, please don’t hesitate to contact our support desk.
The IP address used for health monitoring (inbounddata.wormly.com) will change from 184.108.40.206 to 220.127.116.11. This address will also become the source IP requesting the agent for Linux health monitoring installations.
The uptime monitoring node IPs will remain the same – an up to date list of monitoring node IPs can always be found here.
These can and will change over time, however, so we do not recommend creating IP-specific firewall rules for use with our service.
It’s been in private beta for quite a while now, so we’re very pleased to announce the immediate availability of the Wormly Developer API. Or ‘WAPI’ for everyone who loves an acronym.
Head over to:
And check it out. Or just click the “Developer API” link that you will now find at the bottom of every page.
We’re still in the early stages of WAPI’s development, and consequently the API coverage of Wormly’s functionality is by no means complete. With your feedback and suggestions, though, we will be increasing API coverage as quickly as possible.
So drop us a line if we’re missing something crucial to you!
Primarily to assist those who on-charge the cost of notifications (SMS, phone calls, etc) to their customers, the alert log browser now includes the name and numeric ID of the host that triggered the alert.
These records can now be exported in .CSV format in addition to being viewed on screen.
The alert log browser can be found under Settings, Alert Recipients, View Alert History (toward the bottom of the page).
Due to (somewhat surprising) popular demand, we have implemented a system allowing customers to make pre-payments of any amount to their Wormly account.
The pre-payment is added to your account balance, and used to pay subsequent invoices automatically as they fall due.
Additionally, the process includes the creation of a print friendly invoice. This can be handy if your accounts department needs to see some paper before they hand over the cash, figuratively speaking.
The pre-payment invoice can then be settled via Visa, MasterCard or PayPal.