Does your DR plan include total destruction? Check ours out!

On Tuesday Night (March 10th 2021) a fire destroyed most of our primary French Datacenter in Strasbourg, where OVH host thousands of servers. Luckily nobody was injured, but it took over 100 firefighters 6 hours to get things under control. Our service is running fine – we had 40 mins downtime.

How we handled it (from our Slack channel):

Using Slack has been brilliant for our security – the whole team can see all alerts and warnings in one place – we aggregate lots of security checks here.

In the morning we ran more checks:

  • 0900am – 2nd line support run checks on our off site log storage system, so see if anyone has lost data in the window between our last data sync at 1am, and the outage at 130am.
  • 1100am – confirm that no activity had happened, so no customers affected. Had there been issues, logs would have identified the users, and we would have let them know.

Our senior engineer did a great job setting up the failover that let this run so smoothly. I would name drop him but he likes his anonymity… ?

I think OVH have lost over 10,000 servers in this, so my heart goes out to the team at @OVHcloud_FR – they face a tough time rebuilding. There will be some hard questions asked, a data center like this should not catch fire, but, sh*t happens…

A lot of the OVH clients, as always happens, have not got backup or disaster recovery plans in place. Their data, and possibly businesses, permanently lost. Even the cloud has holes. A total data centre loss like this is very rare, I’ve seen it twice in my career, the first time about 20 years ago in the US, when a fire truck crashed into a data centre.

The most important thing to have is, as always, off-site backups, and ideally what the techs call ‘cross-geographic failover’ – i.e. backup servers somewhere else, ideally with a different provider.

tweets