![]() |
|
|||
|
Dear Users,
This message is related to your clustered hosting service and the outage experienced on the 11th of January 2010 Outage length: 30-45 minutes Outage time: 5.03pm GMT Affected services: Linux and Windows hosted websites on clustered hosting. Severity: Major Our clustered hosting architecture currently employs 2 mass storage devices. This storage system holds all of your website files which are then accessed over a local backend network by the frontend web servers (Linux and Windows) as well as the FTP service, shell service and control panel. MySQL, MSSQL and e-mail services use separate storage infrastructure which was not affected by this outage. At no point was any data at risk. Websites are split into groups based on the first letter of the domain and both storage systems store a complete copy of the data set. For each 'letter' group, one storage node is the 'master' holding the live data (read/write) and one the 'slave' (read only). Under normal conditions, sites are served from whichever storage array is the 'master' for that letter group. Each letter group is synchronised at regular intervals from the 'master' to the 'slave' to provide an online backup of all live data, as well as archived snapshots which allow you to access files you have previously changed or deleted. We then periodically take a tertiary backup to a datacentre 30 miles away over a fibre optic link for disaster recovery. On Monday the 11th of January, at approximately 5.03pm GMT, we became aware of an issue affecting the one of the aforementioned storage systems. The affected system was, at that time, the 'master' for approximately 90% of websites. The storage array appeared to have crashed and was not responding to ping (a low level network check) on the local network nor would the system restart after attempting multiple reboots. Further analysis revealed that one of the RAID cards in the system had failed in a catastrophic manner, which would not allow the system to start. Due to the unavailability of this storage system, web server worker processes became 'stuck' and were unable to continue serving websites. These web servers needed to be rebooted in order to restore service which is something we are working to address. On discovering that the failed storage array could not be recovered quickly, we took the decision to start serving all sites from the secondary array in order to bring them back online. Sites in letter groups for which this was not the live/master storage array were read only (files could not be updated by FTP) and website files would have appeared to temporarily 'roll back' to a time upto a few hours prior to the outage. To re-iterate, the 'roll back' only affected website files and *not* databases or e-mail. Linux websites came up after approximately 30 minutes of downtime with Windows websites taking a little longer, experiencing around 45 minutes of downtime. All databases, Email and webmail on the clustered hosting system were unaffected. Once we were able to replace the failed RAID card in the primary storage array, the previously 'live' data was sent to the secondary storage array and read/write access was restored over the following hours. Full read/write access for all websites was restored by 10.30pm, allowing users to use FTP to update their websites and the like. During the time that sites were read only, 99.9% of websites such as blogs, forums and e-commerce systems continued as normal, being database driven. Users would have only encountered problems if they were trying to update their site via FTP during this period, or if their site was writing to the web root (we always recommend storing cache files etc in /tmp rather than your web root, however). Unfortunately, given a long enough time period, failure of at least one component is inevitable in any system. What we are attempting to do is to accept that certain types of failure are inevitable and to improve systems and policies to deal with those failures as and when they occur; to ensure that the system as a whole is not affected, thus reducing (or even eliminating) any impact such a failure has on our users. The type of failure that occurred in the clustered hosting storage array is typically a very rare failure, but in standard 'single server' hosting systems can often lead to many hours of complete downtime whilst RAID arrays are rebuilt and/or spares are located. Whilst our customers did experience downtime, it was comparatively very brief. We do understand in any case that any downtime is always an inconvenience, and as such we are implementing certain procedures that should severely limit the impact of any such failure in the future. - The first of these measures will be to increase the frequency that data is replicated from the 'master' to the 'slave' storage system to approximately 15 minutes. This will reduce the impact of any temporary 'roll back' due to a failure in either of the storage arrays. - The second of these measures which was already underway at the time of the outage is to increase the distribution of 'live' letter groups between the 2 storage arrays so that a storage failure in one array can at worst impact 50% of hosted sites. - The third of these measures is to attempt to prevent stuck processes from taking down websites which would otherwise be live when a failure occurs in a storage array which is not the 'master' array for that site. We have already taken steps to ensure this is the case. We will also be maintaining an offsite hosted page, at status.tsohost.co.uk. This page will contain live information of any current outages so that we can keep customers better informed of any incidents and recovery progress. The clustered hosting system is unique to Tsohost and we're working hard every day to continually add not only new and interesting features to the system, but at the same time to improve its resilience. Such is the popularity of the system it now hosts more than 8,000 websites across the platform, allowing all users to not have to worry if they're featured on sites such as BBC News or Digg. With your continued support we hope this will grow long in to the future. Once again, we sincerely apologise for any inconvenience caused by this outage, if you have any further questions or queries, please direct them to support@tsohost.co.uk. Regards, Tsohost Support. |
![]() |
| Thread Tools | |
| Display Modes | |
|
|