The right roles and strategy for your cloud disaster recovery team
Defining your disaster recovery strategy requirements and building your team
As we touched on with our previous blog about cloud disaster recovery, you can’t afford any disruption of your cloud. Your business continuity and reputation are on the line, so any situation needs to be resolved quickly and fully.
Your disaster recovery plan is a much-needed insurance policy. It’s there to cover you in case the unpredictable happens. And this is the crux of the matter: you don’t know what issue will occur.
As state-sponsored attacks on technology heavyweights like AWS and Microsoft (e.g. ScarletEel, MidnightBlizzard) start to become a core part of rogue-state cyber warfare strategy, it’s worth considering if you need to adjust your own posture to reflect this risk.
The chances of a disaster - man-made or natural - affecting multiple geographical regions are very small, but they’re still a worrying possibility.
The unthinkable: your entire cloud service provider (CSP) goes down
So, what would happen if a major incident or attack brought down an entire cloud service provider (CSP)?
Would a multi-region backup and restore be enough in this scenario, or is a multi-cloud contingency needed?
By backing up in a different CSP region, you’re already adding an extra layer of protection – so is it worth using an entirely different CSP as a backup for the backup?
This tactic could offer some peace of mind, but it would come with a significant cost and adds a degree of complexity to your backup plan.
Also, you should consider what it means if Microsoft or AWS goes down completely: this scenario seems almost unthinkably apocalyptic, and a ‘multi-cloud backup’ may not protect business continuity if the rest of the world comes to a standstill.
Our take: in almost all cases, the cost of a multi-cloud DR plan is not justified, and a multi-region approach will be more than enough.
Optimizing disaster recovery with automation and testing
Instead, it’s a better use of your resources to apply a rigorous testing regime that will make sure your multi-region contingency plan will work the way you expect it to.
A lot of your multi-region disaster recovery can be automated with Infrastructure-as-Code, such as Terraform. Terraform is especially good at deploying in multi-region environments, and it can work with multi-cloud environments too.
This can take the strain off your team by rapidly redeploying cloud resources with any CSP, provided you’ve reconfigured your deployment to another region.
You can use our open source Terraform module to help automate your versioned backups to AWS S3.
As (versioned) backups will generate a lot of data over time, you should apply a lifecycle rule that will migrate older versions to low-cost archiving like AWS Glacier.
Building your ‘A’ team
Deciding on the right strategy for your disaster recovery plan is the first step.
Next, you need to organize your team, so everyone knows their responsibilities. This is especially important if using a more complex disaster recovery plan like the ‘Warm Standby’ or a ‘Multisite (active/active)’ strategy.
Your disaster recovery team must be well managed to get everything back up and running in the shortest time possible – especially for multi-region recovery. A well-managed team is key to reducing your downtime to the absolute minimum, and the whole team must be aligned with the ultimate objective: business continuity.
Each role can be held by one person, or a small team, depending on your needs.
5 essential roles for your disaster recovery (DR) team:
- Overall lead
- Crisis management
- Business continuity management
- Impact recovery and assessment
- Quality Control
Now let’s look at each of these roles in detail and see what they entail.
Overall lead
This person is responsible for the overall strategy and coordination of the four main elements of your disaster recovery plan. It’s essential that this role can build a holistic plan that reflects the business needs. There’s a considerable responsibility involved: strategy management, compliance, service contracts, budgeting, risk assessment planning and coordination, and skills management. The DR lead must also liaise regularly with management and/or business teams, and create a complete plan including a complete runbook, regular testing, auditing, and maintenance.
Crisis management
The role of crisis management is to step in and trigger your DR plan as soon as it’s needed. Crisis management is all about communication: contacting customers, partners, and team members so they know what’s going on and what they need to do, with regular updates. This communication should help the other roles coordinate and keep everyone appraised of the status quo.
Business continuity management
As business continuity is the ultimate goal of your DR plan, this role is needed to ensure that your contingency planning and policies work towards this objective by working closely with each business unit, if needed. They should seek to maximize business continuity, by assessing all alternative methods of (partial) continuity in case your disaster recovery timeline exceeds your planned objective. This way the most critical processes can be prioritized.
Impact recovery and assessment
This competency handles the technical side of your disaster recovery. Considering the breadth of possibilities, this person (or team) must have the expertise for every scenario including a supply chain attack. Whatever happens, they need to ensure a rapid and complete recovery, covering all aspects of data and infrastructure like networks and configurations. This role must coordinate effectively with Quality Control (below) to assess their state of readiness and adjust processes where needed.
Quality Control
You must have Quality Control (QC) to ensure that your recovered state matches your requirements for capacity, compliance, and security. This involves devising and executing a validation process for application configurations, networks, security, and compliance. This role covers both pre-emptive testing and post-recovery auditing.
An overarching strategy
Coordinating each of the roles above and using a comprehensive run book will ensure the smoothest possible recovery process. But how can you ensure your strategy aligns with your greater business objectives?
What contingencies do you need for your specific situation, to avoid problems even when the worst happens?
It’s absolutely essential that your plan measures up, and this may involve bringing in some special expertise for key roles.
If you want to talk about how to ensure business continuity with the most suitable cloud disaster recovery strategy, we’re ready to help. Get in touch.