Today’s apps consist of multiple component tiers with dependencies both inside the app and outside (on other applications). While separating these apps into logical operational layers yields many benefits, it also introduces a challenge: complexity.
Complexity is one of many enemies when it comes to recovery. Especially as service level objectives (SLOs) and tolerances for downtime become more and more aggressive. Traditional approaches to disaster recovery (DR) — manual processes — just cannot cater to these ever-increasing expectations, ultimately exposing the business to greater risk and longer periods of costly downtime, not to mention potential infringement on compliance requirements.
Orchestration and automation are proven solutions to recovering entire apps more efficiently and quickly, mitigating the impact of outage which, as Agent Smith puts it, is
I wanted to share with you some tips, tricks, best practices and some of the tools that will help you be more successful and efficient when building and executing a DR plan that will recover entire apps quickly and reliably.
Discover and understand applications
After the intro to this blog, it almost needn’t be said here, but it’s critical to take an application-centric mindset to DR and plan based on business logic, not just VMs. Involve application owners and management to understand the app and build your plans accordingly. What VMs make it work? Are they tiered? Do those VMs need to be powered on in correct sequence? Or can they be powered simultaneously? How can I confirm that the plan works, and the app is running? These are just a few questions you need to ask to get started. But understanding the answers, documenting them, and building your strategy based on business logic is essential to successful DR planning.
Utilize vSphere tags
Once you understand those apps, utilize VMware vSphere tags to categorize them for a more policy-driven DR practice. This is critical as changes to apps that have not been identified and implemented into a plan are among the more common causes for DR plan failure. For example, when using tags and Veeam Availability Orchestrator, as an app changes (such as a VM being added), those VMs are automatically imported into orchestration plans and dynamically grouped based on the metadata in the tag assigned. The recoverability of this new app configuration will then be determined when the scheduled automated test executes, with the outcome subsequently documented. If it succeeds, you can be confident that that app is still recoverable despite the changes. If it fails, you have the actionable insights you need to proactively remediate the failure before a real-world event.
Optimize recovery sequences
Recovering the multiple VMs that make up an app is traditionally a manual, one-by-one process, and that’s a problem. Manual processes are inefficient, lengthy, error-prone and do not scale. Especially when we consider large apps, or those that are tiered with dependencies where the VMs need to be powered-on in perfect sequence. Utilizing orchestration eliminates the challenges of manual processes while optimizing the recovery process, even when a single plan contains multiple VMs that require differing recovery logic.
For example, the below illustrates a traditional three-tiered app where in a recovery scenario we must power on our database VMs, then the application VMs, and finally the web farm VMs. Some of these VMs can be recovered simultaneously, while others must be recovered sequentially due to the aforementioned dependencies, not to mention differing recovery steps for different types of VMs. Through orchestration, we can drastically optimize recovery processes and efficiencies while also greatly improving the time it takes to recover the application. Within the orchestration plan, we can ensure that each VM has the relevant plan steps applied (e.g. scripting), that certain groups of VMs are recovered simultaneously where applicable, and that there is next to no latency between the plan steps, like sequential recovery of the VMs. This gives the plan speed, and that’s exactly what we want in disaster.
To account for disasters that are more widespread, we could build multiple applications into a single orchestration plan, although this isn’t necessarily advisable as in the event of an outage that is confined to a single app, we only want to failover that single app, not all of them. Instead, build out orchestration plans on a per-app basis, and then execute multiple orchestration plans simultaneously in the event of a more widespread outage. If the infrastructure at the DR site is sized and architected appropriately, you can test and optimize the recovery process for multiple apps and gauge how many orchestration plans your environment is capable of executing at once.
Leverage scripting
Building on the above, the ability to orchestrate and automate scripts will also help further reduce manual processes. Instead of manually accessing the VM console to check if the VM or application has been successfully recovered, orchestration tools can execute that check for me, e.g. ping the NIC to check for a response from a VM, or check VMware tools for a heartbeat. While this is useful from a VM perspective, it doesn’t confirm that the application is working. A tool like Veeam Availability Orchestrator features expansive scripting capabilities that go beyond VM-based checks. It not only ships with a number of out-of-the-box scripts for verifying apps like Exchange, SharePoint, SQL and IIS, but also allows you to import your own custom PowerShell scripts for ultimate flexibility when verifying other apps in your environment. No two applications are the same, and the more flexibility you have in scripting, the more precise and optimal you can be in recovering your apps.
Test often
Frequent and thorough testing is essential to success when facing a real-world DR event for multiple reasons:
- When tests are successful, it’ll deliver surety and confidence that the business is resilient enough to withstand a DR event, verifying the recoverability and reliability of plans.
- When tests fail, it enables the organization to proactively identify and remediate errors, and subsequently retest. It’s better to fail when things are OK than when they’re not.
- Practice! Nobody is ever a pro the first time they do something. Simulating an outage and practicing recovery often develops skills and muscle memory to better overcome a real-world disaster.
- Testing environments can also be utilized for other scenarios outside of DR, like testing patches and upgrades, DevOps, troubleshooting, analytics and more.
Guidelines on DR testing frequency can vary from once-a-year to whenever there is an application change, but you can never test too often. Veeam Availability Orchestrator enables you to test whenever you like as tests are fully-automated, available on a scheduled basis or on-demand, and have no impact to production systems or data.
Veeam can help
I’ve mentioned Veeam Availability Orchestrator a few times throughout this blog, delivering an extensible, enterprise-grade orchestration and automation tool that’s purpose-built to help you plan, prepare, test and execute your DR strategy quickly and reliably.
What’s more is that with v2, we delivered full support for orchestrated restores from Veeam backups — an industry-first. True DR is no longer just for the largest of organizations or the most mission-critical of apps. Veeam has democratized DR, making it available for all organizations, apps and data.
Whether you’re already a Veeam customer or not, I strongly recommend downloading the FREE 30-day trial. There are no limitations, it’s fully-featured and contains everything you need to get started. Have a play around, build your first plan for an application and test it. I almost guarantee you’ll learn something about your environment or plan that you didn’t know before.