SRV1 Downtime Post-Mortem
As many of our customers will already be aware, we had an extended outage affecting one of our servers this weekend caused by a hardware problem. Now that everything is mostly back to normal, we wanted to explain in detail what happened.
In the second half of last week, we had a couple of outages on SRV1 where the server slowed down and stopped responding. The same thing happened again first thing Saturday morning, with the exception that this time the server’s logs were filled with disk IO errors. We immediately alerted the on-site technicians at the datacenter, and the server was taken down while the hard drives were scanned. This scan found errors on one of the server’s hard drives, and the decision was made to immediately replace that drive and rebuild the RAID array. Ordinarily this can be done with no downtime while the server is running, since the drives are hot-swappable, so we were anticipating a nice straightforward hardware replacement. As a precaution, a full backup was started to make sure that our backups were as up-to-date as possible before any work was carried out on the drives.
After the backup process had been running for several hours, but before it could be completed, the server stopped responding again. This time, when the server was brought back online it was clear that the filesystem was now in a terrible state and that the server was barely functional. It now looked like the problem was not limited to just one hard drive, but two of the server’s drives (out of four total) needed to be replaced. This meant that simply swapping in new hardware and allowing the RAID array to rebuild itself was no longer an option and we needed to re-install the server’s operating system and restore from our backups.
The hard drives were no longer working well enough for us to finish the backup which had been started that morning, but we were able to merge the partial backup with an older backup to restore as fresh and complete copy of the data as possible. Unfortunately the older backup was several days old; the most recent scheduled backup had been interrupted by one of the earlier outages, which meant the newest full backup we had was from the first half of the week. Merging with the partial backup helped, but we were not able to avoid the loss of some of the newest files on the system which had been uploaded in the couple of days beforehand.
Re-installing the operating system took a few hours, and we then started the process of restoring the backups. This process took over 36 hours to complete, during which time most web sites (and email) were unavailable (individual sites became operational as they were restored, so some customers experienced less downtime than others). In total, the time from the start of the outage on Saturday to when the backup restore finished (on Monday) was about two and a half days.
This incident has highlighted a few issues which we will be addressing:
- While RAID 10 is resilient when it comes to hard drive failures, it is not indestructible. We will be immediately setting up more detailed proactive monitoring to see if we can get earlier warnings of hard drive problems in the future.
- Our scheduled backups were not as up-to-date as we would have liked. The system is supposed to provide daily backups, but the backups take so long that this target can sometimes be missed (and if a backup is interrupted it can take too long to catch up). The backup system needs to be improved so that backups finish more quickly, and so that we can provide better than daily backups.
- The process of restoring the backups took far too long. This is the first time we’ve had to perform a full restore using our current backup system, and we were unhappy to see how long the whole process took. In an emergency, we need to be able to restore everything in a maximum of one or two hours.
To address the second two points, we will be investing in a more sophisticated backup system which should be operational by the end of this week and will not only allow us to perform multiple backups per day, but will reduce the total time to recover from this kind of incident to a fraction of what it took this time.
We sincerely apologise for both the duration and the severity of this outage, and we want to thank our customers for being patient while we worked to restore service.