We’re pleased to announce the immediate availability of injecting Wormly Alerts into Discord chat rooms and Microsoft Teams Chat:
During this period we experienced a failure of our Uptime Monitoring service which resulted in periods of failures not being detected for some customers; i.e. “false negatives”.
The root cause was relatively simple to uncover; a bug introduced in a code deploy that morning which did not trigger failures in our CI pipeline, nor our monitoring system.
Needless to say, our attention has been focussed on investigating why both our CI and platform monitoring systems did not reveal this problem sooner; an 8 hour failure window is most certainly not acceptable to either us or our customers.
Rest assured that when these sorts of failures occur – humbling as they are – we learn a great deal and apply that knowledge to improve the quality and reliability of our infrastructure.
Given our track record for exceptional reliability over the past 15+ years, we are very disappointed by this incident, and unreservedly apologise to all customers, both those impacted and those who were not.
Please don’t hesitate to contact support if you have any further queries or concerns.
This evening we experienced a partial outage of our Uptime Monitoring network during a 15 minute window which coincided with the outage at Cloudflare (a provider of CDN and reverse proxy services).
This occurred between 21:20 and 21:35 UTC, July 2017 2020.
During this time, uptime tests were not completed for a majority of our users, and accordingly you will notice the missing data period in your uptime reports and performance graphs.
We do sincerely apologise for any inconvenience that today’s outage has caused you.
Our preliminary investigation suggests that an overload condition occurred when the significant proportion of our users who use Cloudflare on their network edge simultaneously experienced failure conditions. This led to a backlog of test confirmations and a high alert load generated by Wormly in response.
Rest assured that when these sorts of failures occur, we learn a great deal and apply that knowledge to improve the quality and reliability of our infrastructure.
We are currently formulating improvements to our cluster consensus protocol to reduce the likelihood of such a “failure spiral” occurring again in future.
Even a few seconds of downtime in a year is too much for us, and we’ll always keep striving to keep it as close to zero as possible!
We’re pleased to announce the immediate availability of HTTP/S monitoring of web servers over the IPv6 network.
By default all tests are conducted over IPv4. To enable IPv6 you need to set the IP Mode to IPv6 only in your sensor configuration as shown below:
This will cause the IPv6 address to be resolved from the hosts’ AAAA record. If required, you can override this IPv6 address by entering the desired address into the Force IP Address parameter field.
Realtime HTTP Test Tool
Under the hood, this uses the same HTTP client implementation as the rest of the Wormly monitoring network, so it’s a useful way to perform manual tests when choosing your automated monitoring strategy within the Wormly dashboard.
You can also share the results of each test conducted simply by sharing the URL generated, a great way to work with your team resolve issues together. Naturally you can use it to perform IPv6 tests, too!
If you have multiple web servers sitting behind a load balancer or within a Content Distribution Networks (CDN), you can now monitor them individually so long as they are reachable via a public IP address.
The Wormly HTTP sensor parameter Force IP Address allows you to target a specific web server using SNI. Simply enter the IP address of the web server you wish to target in that field:
This new feature is handy for monitoring a specific server in a CDN – or a single server within your load balancing cluster. Naturally it also works with regular HTTP requests.
We’ve just deployed an incremental improvement to the web & mobile UI: Test response inspection now includes code highlighting for HTTP, JSON, JS, HTML/XML:
At ~01:00 UTC on Sunday May 20th the database cluster which underpins the Wormly Metrics service suffered a partial outage.
This caused the failed receipt of Metrics for all customers, as well as a secondary effect which in some cases included incorrect alerts being sent. If you experienced spurious alert messages during this time please let us know and we will ensure these are refunded from your account.
The duration of the outage was, unfortunately, lengthy. Service was restored at 04:20, with some short periods of instability over the following 3 hours.
The Uptime Monitoring service was not affected by this outage.
Currently our post-mortem is underway so we don’t yet have a firm idea of the cause and what the possible mitigations might be for the future. We will keep you updated.
We’re in the uptime business, and really regret any downtime whatsoever. So on behalf of Wormly I apologise for this incident and promise that we will continue to do our best to keep improving. The availability of the Metrics service has exceeded 99.9% in the past 24 months, but I’m sure we can do better.
On 24/May/2016 between 13:33 and 14:00 UTC the Wormly Metrics system experienced a serious outage. During the 27 minutes of the outage, no Metrics were accepted by our system; and depending on your integration methods you would likely have seen HTTP connect timeouts whilst attempting to submit Metrics via our API.
As a consequence no Metrics-sourced Alerts were sent – nor is graph data available – during that time period. Additionally, at the moment service was restored, users who have enabled “dead man’s switch” alerts would likely have received false positives indicating that their hosts were failing to report Metrics to the Wormly service. However the fault was clearly on our end, not yours.
In the 11 months since we launched Metrics this is the first outage of more than a few seconds that we’ve experienced, and we’re certainly not happy about it. The cause was a routine upgrade of the Percona XtraDB cluster software that powers the Metrics data persistence layer. After the upgrade, it entered a failure state that we had not previously seen in any of our dev, staging or production environments.
We are working with the vendor to better understand the root cause and avoid any future recurrence. We have also updated our operations run-book detailing the scenario and what we learned from the (admittedly slow) resolution process.
Please note that our Uptime Monitoring service was unaffected, and all uptime monitoring, reports & alerts continued to operate normally.
We are very sorry for the inconvenience this has caused you. We know and value the high level of trust that you place in us to always be “up” so we can help you do the same.
At Wormly we’re ardent believers in the maxim that performance is a feature. Google’s SPDY protocol is an experimental extension to HTTP with the goal of reducing the latency inherent in the HTTP protocol. Or, rather more simply: “Get those pages delivered to the user faster!”
From our perspective the biggest benefits come from stream multiplexing and header compression.
Multiplexed streams allow the browser to simultaneously request a bunch of resources over a single TCP connection – particularly handy because it avoids the latency penalty of setting up multiple TLS sessions for HTTPS.
Header compression (both request and response headers) helps further reduce the amount of data that must traverse the pipes of the internet. These headers are probably bigger than you think – take an example request to wormly.com:
That’s 348 bytes just in the request header. And most users’ upstream bandwidth is rather more limited than your web servers’.
SPDY is currently supported by Firefox and Chrome – on Windows, Linux, Mac and Android platforms. As it happens more than 70% of our users are on Chrome, so the benefits of adopting SPDY will be seen widely.
Google has released a very simple to use Apache extension to add SPDY support to your web server. We use Apache to serve www.wormly.com and we were also using mod_php to serve the parts of our web application which are written in PHP.
This actually presents a problem with mod_spdy, because its’ implementation is threaded, and PHP is not guaranteed to be thread safe. The recommended approach is to replace mod_php with a FastCGI interface to serve any PHP requests from a dedicated pool of PHP processes – leaving the Apache processes free of any trace of the PHP interpreter.
We actually intended to switch to this sort of setup some time ago – and to switch from Apache to Nginx in the process. However a still-unresolved limitation in Nginx whereby all FastCGI requests are buffered – meaning PHP cannot ensure that data is immediately flushed to the client – has so far prevented us from doing so.
Happily the availability of mod_spdy pushed us to find a solution; and ultimately we found it in mod_fastcgi, which does support unbuffered connections to the upstream PHP server.
Switching to FastCGI also allowed us to switch Apache to the Worker MPM rather than the Prefork MPM – and the benefits of these two changes significantly reduced the volatility of response times through our web application. So we had a win even before deploying SPDY!
After this, installing mod_spdy is as simple as:
Restart Apache, and you’re done.
Note that Google’s included package management handles automatic updates of the mod_spdy package.
It should be noted that mod_fastcgi is not commonly included in CentOS / Red Hat Enterprise linux distributions and their derivatives (including Amazon Linux). You can either compile it from source or use a 3rd party repository which does include it: for example the RPMForge, DAG or City-Fan.
Regrettably we woke up some of our beloved sysadmin users last night – those whom reside in the Asia/Pacific time zones at least. A number of false alarms were issued by Wormly – owing to a recent update to our asynchronous DNS resolution intended to improve the performance of our Uptime Monitoring system.
As it turns out there are still some issues with this library when operating under load, and this resulted in 3 instances of spurious “Timeout while contacting DNS servers” errors between Tue, 23 Jul 2013 21:15:21 +0000 and Wed, 24 Jul 2013 21:05:00 +0000. The deploy was reverted at this time.
A total of 528 erroneous SMS & Phone Call alerts were sent as a result. These have been removed from the affected accounts to ensure that users are not charged for these messages.
We’re acutely aware that false alarms – especially those that unnecessarily wake you up – are an enormous failure on our part, and for this we are deeply sorry. We have already increased the sensitivity of our internal monitoring of these processes so that we could catch anomalies like these faster. We are also working to improve our integration testing environment to add tests which might reveal these kinds of problems pre-deploy.
Please do contact support if you have any other concerns surrounding this incident – we want to make things right for you.
Once again, we’re really sorry to have failed you – and we’re working hard to minimize the risk of breakages like these and others in future.