Tsohost.co.uk Forums - UK Web Hosting  
UK Web Hosting Linux Hosting Windows Hosting Domain Name Registration

Go Back   Tsohost.co.uk Forums - UK Web Hosting > Web Hosting > Announcements and Service Status

Reply
 
LinkBack Thread Tools Display Modes
  #21 (permalink)  
Old 11th January 2010, 09:29 PM
Member
 
Join Date: Oct 2008
Posts: 30
Default

My uploading is working too.

Ta Adam and Darren
Reply With Quote
  #22 (permalink)  
Old 11th January 2010, 09:29 PM
Tsohost Staff Member
 
Join Date: Nov 2006
Posts: 565
Default

All letters are now read/write .

If anyone has any residual issues please contact support.
Reply With Quote
  #23 (permalink)  
Old 13th January 2010, 10:40 AM
Member
 
Join Date: Oct 2008
Posts: 30
Default

Quote:
Originally Posted by Darren View Post
We will be issuing a more full RFO in due course, as well as reviewing our failover policies to reduce windows of downtime due to any hardware failure.
Have I missed this?
Reply With Quote
  #24 (permalink)  
Old 13th January 2010, 10:44 AM
Tsohost Staff Member
 
Join Date: Nov 2006
Posts: 565
Default

We haven't yet prepared the full RFO. Sorry for the delay.
Reply With Quote
  #25 (permalink)  
Old 13th January 2010, 02:31 PM
Tsohost Staff Member
 
Join Date: Nov 2006
Posts: 565
Default

Dear Users,

This message is related to your clustered hosting service and the outage experienced on the 11th of January 2010

Outage length: 30-45 minutes
Outage time: 5.03pm GMT
Affected services: Linux and Windows hosted websites on clustered hosting.
Severity: Major

Our clustered hosting architecture currently employs 2 mass storage devices. This storage system holds all of your website files which are then accessed over a local backend network by the frontend web servers (Linux and Windows) as well as the FTP service, shell service and control panel. MySQL, MSSQL and e-mail services use separate storage infrastructure which was not affected by this outage. At no point was any data at risk.

Websites are split into groups based on the first letter of the domain and both storage systems store a complete copy of the data set. For each 'letter' group, one storage node is the 'master' holding the live data (read/write) and one the 'slave' (read only). Under normal conditions, sites are served from whichever storage array is the 'master' for that letter group. Each letter group is synchronised at regular intervals from the 'master' to the 'slave' to provide an online backup of all live data, as well as archived snapshots which allow you to access files you have previously changed or deleted. We then periodically take a tertiary backup to a datacentre 30 miles away over a fibre optic link for disaster recovery.

On Monday the 11th of January, at approximately 5.03pm GMT, we became aware of an issue affecting the one of the aforementioned storage systems. The affected system was, at that time, the 'master' for approximately 90% of websites. The storage array appeared to have crashed and was not responding to ping (a low level network check) on the local network nor would the system restart after attempting multiple reboots. Further analysis revealed that one of the RAID cards in the system had failed in a catastrophic manner, which would not allow the system to start. Due to the unavailability of this storage system, web server worker processes became 'stuck' and were unable to continue serving websites. These web servers needed to be rebooted in order to restore service which is something we are working to address.

On discovering that the failed storage array could not be recovered quickly, we took the decision to start serving all sites from the secondary array in order to bring them back online. Sites in letter groups for which this was not the live/master storage array were read only (files could not be updated by FTP) and website files would have appeared to temporarily 'roll back' to a time upto a few hours prior to the outage. To re-iterate, the 'roll back' only affected website files and *not* databases or e-mail. Linux websites came up after approximately 30 minutes of downtime with Windows websites taking a little longer, experiencing around 45 minutes of downtime. All databases, Email and webmail on the clustered hosting system were unaffected.

Once we were able to replace the failed RAID card in the primary storage array, the previously 'live' data was sent to the secondary storage array and read/write access was restored over the following hours. Full read/write access for all websites was restored by 10.30pm, allowing users to use FTP to update their websites and the like. During the time that sites were read only, 99.9% of websites such as blogs, forums and e-commerce systems continued as normal, being database driven. Users would have only encountered problems if they were trying to update their site via FTP during this period, or if their site was writing to the web root (we always recommend storing cache files etc in /tmp rather than your web root, however).

Unfortunately, given a long enough time period, failure of at least one component is inevitable in any system. What we are attempting to do is to accept that certain types of failure are inevitable and to improve systems and policies to deal with those failures as and when they occur; to ensure that the system as a whole is not affected, thus reducing (or even eliminating) any impact such a failure has on our users. The type of failure that occurred in the clustered hosting storage array is typically a very rare failure, but in standard 'single server' hosting systems can often lead to many hours of complete downtime whilst RAID arrays are rebuilt and/or spares are located. Whilst our customers did experience downtime, it was comparatively very brief. We do understand in any case that any downtime is always an inconvenience, and as such we are implementing certain procedures that should severely limit the impact of any such failure in the future.

- The first of these measures will be to increase the frequency that data is replicated from the 'master' to the 'slave' storage system to approximately 15 minutes. This will reduce the impact of any temporary 'roll back' due to a failure in either of the storage arrays.

- The second of these measures which was already underway at the time of the outage is to increase the distribution of 'live' letter groups between the 2 storage arrays so that a storage failure in one array can at worst impact 50% of hosted sites.

- The third of these measures is to attempt to prevent stuck processes from taking down websites which would otherwise be live when a failure occurs in a storage array which is not the 'master' array for that site. We have already taken steps to ensure this is the case.

We will also be maintaining an offsite hosted page, at status.tsohost.co.uk. This page will contain live information of any current outages so that we can keep customers better informed of any incidents and recovery progress.

The clustered hosting system is unique to Tsohost and we're working hard every day to continually add not only new and interesting features to the system, but at the same time to improve its resilience. Such is the popularity of the system it now hosts more than 8,000 websites across the platform, allowing all users to not have to worry if they're featured on sites such as BBC News or Digg. With your continued support we hope this will grow long in to the future. Once again, we sincerely apologise for any inconvenience caused by this outage, if you have any further questions or queries, please direct them to support@tsohost.co.uk.

Regards,

Tsohost Support.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT. The time now is 11:20 PM.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.3.2