briansmith84 well, that's what having a well thought out and implemented disaster recovery plan is for. It's extremely important. Some key mistakes my dude made:
Never run major updates on a Friday evening. It's the absolute worst time, as everyone is heading to take the weekend off, and even if there is weekend support available it's usually the lesser skilled part timers. Major updates should be performed during business hours, when you know you can possibly get support (from enhance or other sydadmins).
Never run a brand new feature against your entire software stack without doing a test first. If you run it on one site, and it works, then run it on 10 sites, then 100 sites, etc.
Build a disaster recovery plan BEFORE a disaster. My dude was searching through online docs when he should have already had a disaster recovery plan in place and known exactly which steps to take (up to the nuclear option of recommissioning a server from backups).
In my career I've gone through dozens of disaster situations, fully blown out servers. Of course I still get the tension, rapid heartbeat, sweaty palms - BUT I know exactly what steps I'm going to take to resolve. Depending on initial diagnosis, some things we'd implement include updating our company status page about the outage, posting a tweet, notify our datacenter staff (if it's hardware related, ie server doesn't respond to pings), give 30 minutes of attempted troubleshooting if it's software based issue before going nuclear and decommissioning the server and recommissioning to the standby server (in this case DNS will update automatically for us, it's just a matter of waiting for backups to transfer to the standby server for redeployment).
We're running enhance backups tri-hourly, so they're very fresh. The idea is we can bring clients back online from a disaster scenario with minimal data loss (up to 3 hours data loss). In most situations a few hours of data loss is much more preferable to ongoing downtime. Our clients actually wouldn't stand for hours of downtime, hence quickest recovery is key. By keeping backups super fresh and having a full dedicated server ready to go sitting idle in "standby" we cut down the time it would take to even provision a new server. The standby server is costing hundreds of $ a month to just sit there doing nothing - but in a disaster it could save 20 minutes of time off recovery or more, well worth it.
Now having a software issue is unique, because at some point you have to consider the cost/benefit of staying offline to troubleshoot. A lot of times it's not an easy fix, or even diagnosing what's wrong can take a while. Do you work on it for an hour, two hours? After 3 hours why not keep trying another few hours? At some point you have to decide troubleshooting is wasting more time than it would take to run an alternative fix like decomish and recomish the server from backups. Of course, if your backups aren't super fresh then that can be a negative on customers too - if they get redeployed on a 24 hour old backup, well they may be pissed to lose a days worth of work they did on their site, etc.
I set a 30 minute limit for software because I and my partners are very well rounded sydadmins, and if one of us can't at LEAST diagnose what the problem is by 30 mins, then it's not worth pursuing. If after 30 mins we know what's wrong, then at that point we would also know how long it would take to fix, then it's a judgment call whether to fix it or redeploy.
Now for a hardware failure we may give more leeway. If the DC staff confirm the ticket and begin investigating within 15-30 mins, then we can give them an hour to resolve the issue. If by 60 minutes they haven't communicated an update with some positive notes, then redeployment is preferable - again it's a judgment call at that point. In the past some DCs we've worked with have been crap at hardware support, taking 6-8 hours to replace a blown CPU chip is unacceptable for downtime, so alternative measures would be implemented long before that.
Our main/old servers have a crazy balance of backups and clones to maintain readiness for a disaster, it's hugely expensive and not great for performance. But, we CAN get a dead server brought back within 3 hours which is remarkable.
The plan with enhance servers is to reduce recovery time to <1 hour, with minimal data loss thanks to 3 hours backups.
It's a significant investment to have a strong backups plan, and strong disaster recovery plan. But when SHTF you will thank your ass for spending the $... And make no mistake, the downtime WILL hit you, as it hits everyone at some point. So get prepared, be vigilant, and keep yourself educated on the subject.