Creating an effective disaster recovery plan for IT infrastructure

When dealing with IT infrastructure, the question is not if something will go wrong, but when. Any company will inevitably hit unexpected software or hardware issues that negatively impact the company performance or even halt their services entirely. Making a plan what to do when something goes wrong can save valuable time when resolving problems and minimize mistakes or forgotten steps in the recovery process. Creating a robust disaster recovery plan involves many different steps that are all necessary to ensure efficient handling of problems once they arise.

Monitoring & Alerting

The first step of recovering from a problem is known you have a problem in the first place. Monitoring should include a wide range of metrics, such as

Hardware health: CPU, Memory, disk space, disk bandwidth, network bandwidth etc
Performance: Response times, latency, throughput etc
Security: Failed logins, missing authorization, positive malware scans etc
Error rates: Application errors, failed transactions, timeouts etc

Collecting these metrics (among others) allows you to define alerts. Striking the right balance for alert thresholds is crucial: if you alert too easily, your response team will grow accustomed to lots of false positive alerts and stop taking them serious / treat them with urgency. If the threshold is too high you may notify the response team too late or miss the problem entirely.

The alert message itself should include the most important details about the problem: what happened, when did it happen, is there any helpful context? Larger infrastructures may also need a muting feature, where lower-priority alerts are muted when larger ones are active, so that the response team is not bombarded with alerts when larger portions of the network fail.

When alerts are in place you need to figure out who gets alerted, and how. The alert should only notify people that are directly involved in resolving the problem (technical response team) or that need to adjust to the current situation (PR spokespeople, upper tech lead / management). Alerts should reach the response team quickly, preferring push notifications or SMS over traditional mediums like email, as those may not be treated with priority (or the alert may be buried in other business mail).

Backup everything

Backups commonly include production data like database contents and user files / S3 objects. Equally important, but often overlooked, are things like:

Config files. Not just of your main application, but also dependencies like NTP, reverse proxies, load balancers, scripts, local systemd services, ...
TLS certificates & SSH keys, API tokens & passwords. Creating entirely new ones has other implications, like needing to update client machines, network auth, password manager contents, ...
Log files. You may need them to understand what happened after restoring the backups
DNS entries especially DKIM keys, SPF policies & reverse DNS entries. Even if you don't host your own DNS server, what happens if someone accidentally deletes your DNS records? The provider will typically only help with service errors, not human mistakes on your part
Business data. This includes received and sent email messages, spam filter settings, internal documents, ...
Application data. This is more specific to your infrastructure, but for example container-based deployments will store application configuration in environment variables per container, which needs to be included in backups (or restoring the backup would still lose application config)

Backup cycles

The frequency at which you make backups decides what maximum data loss you are willing to risk. On the other hand, the backing up data may use resources needed to maintain service quality, negatively impacting user experience or even causing temporary service outages. Making nightly backups may lose you up to 24h of production data in a worst case scenario, but also limits the service degradation to a time frame where little to no users are using the service. For companies with business hours, scheduling backups into time frames outside of employee shifts can be risky as well, as nobody will be present to supervise the process and act if something goes wrong. Discussing and openly communicating these decisions is key to a healthy approach to infrastructure reliability within a company.

Backup security

Since backups are the last resort when things go terribly wrong, their protection plays a special role: if they fail, there is no more fallback plan. Protecting backups consists of three core considerations:

Protecting backup contents from others. Since backups contain a lot of sensitive information like network configuration, passwords and encryption keys, they need to be encrypted. Stolen backups are a prime way for an attacker to gain access to a system without leaving many traces, making it a dangerous vulnerability if left unencrypted.
Protecting backups from storage failure. Saving a backup to a server is fine, but what if that server also fails, or is unavailable when you need backups? Store backups on multiple servers, preferably with a lot of physical distance between them to rule out unforeseen events like natural disasters or power outages.
Protecting backups from corruption. Both network transfers and data storage can occasionally damage data passing through them. Technologies like RAID configurations or integrity checking features can help mitigate this problem.

Validate your backups

Having scheduled backups set up and running is great, but you need to manually verify that you can also restore a backup, and that it really contains every important piece of the system. Finding out your backups were missing some files or not working properly when you need them is going to be very difficult to explain to your boss or clients. You don't have backups unless you restore one.

Backup retention

Another variable tradeoff to make is the retention of backups: How long should you keep old backups around? The longer / more backups are kept, the more storage space they need, but having backups from older points in time allows you to recover from problems that you caught a lot later than expected. Striking a balance will largely depend on your company's needs and should be discussed with employees from the tech side as well as management, to ensure the decision is well understood on both sides and does not rest on one person's shoulders alone.

Fault tolerance isn't fault safety

In modern architectures, growing systems like databases will typically be deployed as clusters of multiple instances. This setup has numerous advantages: the load can be spread across multiple machines, the service can be scaled up and down as needed, and the cluster can recover from a number of nodes failing for any reason, without data loss or downtime.

While this sounds great, this hinges on the assumption that the cluster works as expected. Many companies have lost their production setups by not thinking about what happens when the cluster fails in an unexpected way, like split brain circumventing transactional data constraints or a software bug corrupting data across all cluster members. Software has bugs, be prepared. A highly-available system can fail like any other one - if you don't have backups when it does, you're out of options.

Make a playbook

A playbook in technology terms is simple a document describing a process. It can be a wiki page, a checklist, or even a physical piece of paper. The important part for a disaster recovery playbook is that it contains all the steps needed to recover from the problem, in order. This will typically include

testing and verifying the scope of the problem, what is affected and to what extend
if applicable, notes when to escalate to a different team or notify higher-ups or the situation, how to reach them and what info to include in the message
what steps can or should be taken. when is restoring a partial backup advisable, and when should a complete system backup be used? how do you find the correct backup, and what are the steps to restoring it?
tests to verify that restoring the backup worked correctly and the service is back to normal operation, this may include scripts or checking monitoring metrics
if and how to report the problem and it's resolution, and to who. should something be written down? should a follow-up meeting be called in to discuss how the problem happened, and what can be done to prevent it next time or catch it earlier? did all alerts and the playbook work as intended?

This playbook should be written in an easy to understand way, only including information necessary to resolve the problem. It should be accessible to multiple people, to ensure that it can be used when needed, even if the primary employees responsible for the task are on vacation.

Employee training

People make mistakes, doubly so under pressure. Trying to restore a backup with your company's finances and potentially your job at stake is extremely stressful and may negatively impact the quality of the process. Like employees need to regularly train what to do when the fire alarm goes off, so should they have a "software fire alarm", where they run through the steps necessary to recover from an unexpected outage or infrastructure problem. It is enough if that process happens once a year, but it should be inclusive, so also newer members know where to find the playbook, and how to execute recovery steps as needed.

Training should include a post-mortem feedback from both sides: executives or tech lead in charge should express how satisfied they were with the speed and process of recover, and the tech team should note how well the playbook worked for them, if any questions remained open or if the process needs to be adjusted anywhere, or reworded to be more understandable.

Regular backup strategy reviews and adjustments

Technology and infrastructure changes, and so do backup requirements. Having a backup strategy and recover playbook are only valuable as long as they are up to date. Making time for the team responsible for backups and disaster recovery to verify the process and information is still up to date, and adjusting it where needed is as important as having a strategy to begin with. This type of technical debt is often overlooked, treating the backup strategy more like a bullet point on a checklist, but if it does not constantly evolve with alongside changes to the infrastructure, you may as well not have one.

Creating fixed schedules for backup strategy reviews can be part of monthly scheduled maintenance tasks, and will take a negligible amount of time if done regularly.