Modern companies rely on their ability to leverage their data and applications to function in almost every aspect of their business, whether it’s their website used for e-commerce, inventory databases, and industrial controls for manufacturers, sales and customer relation management systems, medical records, or payment systems. Without these capabilities, hospitals, utility companies, and emergency services can’t function and provide critical services. Companies could be unable to serve customers, take new orders, or manufacture products, causing loss of revenue and failing to meet obligations or even face regulatory penalties.
Understanding Disaster Recovery in the Cloud
Disaster Recovery planning starts by mapping the business functions and needs to applications/services and determining the impact of an outage or data loss event for the company. From there, the business can weigh the risk and impact to the company against different solutions and their costs to come up with acceptable recovery point objectives (RPOs) and recovery time objectives (RTOs). A Disaster Recovery plan requires understanding the applications and services but also the dependencies on other parts of the ecosystem, and then mapping the RPO/RTO to all systems. These RPOs/RTOs can range from needing to be constantly available to having days or weeks before needing to be online before causing a meaningful impact. Once everything is categorized and inventoried with a map of dependencies, a business can create a plan that ensures they are able to survive various disasters. To learn more, see this article for a more in-depth discussion on RPOs/RTOs. You can also read more on Business Continuity and Disaster Recovery planning.
Traditional Disaster Recovery (DR) plans have historically involved dedicating resources in another location. These additional sites could be reserved capacity in other company-owned data centers, a dedicated DR data center(s), or co-located data centers rented from a hosting provider. Environments like this require utilities (power, cooling, security systems), staff to maintain and secure the environments, and cost of the hardware (servers, network, storage) and software (operating systems, security and monitoring tools, application licensing, etc.). These sites often sit idle for long periods costing the company money and providing little value until an incident occurs. They are static in size and when new expansions occur in production, additional planning is required to accommodate the growth. Hardware will need to be refreshed periodically as systems age out. See this article for more details.
Cloud computing offers a different type of consumption compared to on-premises data centers. Applications with high availability (HA) requirements can be paired with cloud-based machines or moved to highly durable and available solutions from the cloud which can make achieving SLAs easier compared to on-premises-only options. The flexible, on-demand nature of the cloud fits well with the needs for Disaster Recovery which is typically only used during testing and actual disasters, saving you money upfront and allowing you to pay when you use it. The cloud is built to be managed and paired with Infrastructure-as-Code solutions to automate configuration of environments which helps minimize time to duplicate production environments and remove human error during emergencies. Veeam recoveries can also be planned and automated to take the risk out of disaster scenarios which are often the most stressful moments for an organization.
Benefits of Leveraging Google Cloud for Disaster Recovery
Google Cloud offers a large variety of solutions that can help a company successfully implement a Disaster Recovery plan with Regional Data Centers all around the world connected via Google’s own private global network. Google Cloud offerings for DR include Google Cloud VMware Engine (GCVE) and Google Compute Engine (GCE). Google Cloud VMware Engine can help you easily migrate existing on-premises workloads to a VMware cluster in the cloud without changing VM formats, allowing you to seamlessly failback when ready. Google Compute Engine (GCE) is a cost-effective and highly scalable way to recover to the cloud by leveraging native VM instances which gives you options to recover physical and virtual machines on-demand. Both offerings allow you to bring your license with you to immediately start protecting your recovered environment with Veeam Backup & Replication to natively protect GCVE or anything requiring an Agent and Veeam Backup for Google Cloud for native GCE instances and Platform-as-a-Service databases by sending data directly to Google’s Cloud Object Storage.
Google Cloud VMware Engine delivers a fully managed VMware Cloud Foundation stack — VMware vSphere, vCenter, vSAN, NSX-T, and HCX — in a dedicated environment on Google Cloud. By leveraging this service, virtualized workloads can be deployed on Google Cloud in minutes, directly through the Google Cloud Console. Google Cloud VMware Engine enables organizations to run and manage cloud-based virtualized workloads consistently with their on-premises environments. Google Cloud offers a solution user account that can be leveraged by third-party vendors like Veeam to allow for elevated permissions to the VMware environment to allow features like our Continuous Data Protection (CDP), Replication, SureBackup, Instant VM recovery, Re-IP of restored systems, and more, giving you the same experience as a customer managed environment without the overhead of managing the infrastructure. Google Cloud VMware Engine environments can be quickly deployed or expanded in minutes to hours helping you meet your recovery objectives.
Key Components of Google Cloud DR
Disaster Recovery comes in many forms. For many customers, there will be critical infrastructure services that need to be online first before other applications can be recovered. With Google Cloud and Veeam Software, you can leverage a minimally sized GCVE cluster to perform VM CDP or snapshot-based replication from a customer-managed VMware environment to a Google-managed cluster (or from one GCVE environment to another) enabling core services and applications with a low RTO to be staged and ready to power on with no restore time. Customers can quickly scale clusters by setting the quota for their regions ahead of time with what’s required during a Disaster Recovery scenario. Customers can pre-create networks required to mirror production for this cluster, saving time before a restore. Having the minimal cluster offers the fastest recovery solution for customers looking to bring online business-critical applications while not maintaining their own managed data center. With a few clicks, the environment can be scaled to meet the needs of VMs with longer recovery windows by leveraging VM restores. Google Cloud VMware Engine offers several ways to scale compute separately from storage. By default, GCVE leverages VMware’s SDDC suite and comes with vSAN Datastores using internal NVMe disks. Storage-only nodes can be added in some regions and customers have several NFS-based Datastores including Google Cloud Filestore and NetApp Cloud Volumes Service for Google Cloud.
One of the benefits to having an existing GCVE DR cluster is the ability to leverage the Veeam Data Platform Premium’s Recovery Orchestrator. Some other cloud providers limit the capabilities of third-party software to communicate with hosts and perform sensitive operations. Recovery Orchestrator allows you to build recovery plans and validation step that dramatically simplify the process of handling a disaster. A business can leverage recovery plans to handle replication failover and machine restores plus custom scripting to enhance the capabilities and extend the software beyond just powering on systems. Automated Testing, Reporting and Documentation are built-in to the platform, helping customers deal with industry compliance requirements.
The recovery requirements and budget for some customers might not require this pilot-light approach. For customers with a longer restore window but still want to bring workloads back on-premises with minimal effort once issues are mitigated, they could deploy GCVE on-demand. Customers can couple these on-demand clusters with infrastructure automation tools to set up the environment with minimal effort. Once the environment is ready, the GCVE environment can be added to Backup & Replication and businesses can start recovery workloads. Customers can leverage Google Cloud Storage (GCS) as part of a Scale-out Backup Repository or backup directly to GCS to keep a copy of their data in the cloud, close to where they plan to recover. In the case of a DR scenario, having a backup server in the cloud, or leveraging a gateway in GCE or GCVE can allow customers to access their backup data directly from their Virtual Private Cloud (VPC) subnet using Private Google Access. A dedicated Backup & Replication server can be deployed in the cloud and kept offline to minimize extra steps during DR situations. During an DR event, the server can be powered on and the GCVE environment and GCS buckets added. Scripted or manual restores of workloads can be started as soon as the GCVE environment is ready.
Customers might consider migrating some workloads to native GCE instances instead. Veeam Backup & Replication allows physical and virtual machines to restore directly to GCE. Just like with GCVE, backups from anywhere can be recovered to GCE, however, this process can be a bit faster with everything already in Google Cloud. The process for restoring to GCE with Backup & Replication backups involves copying the disk image data to a temporary GCS bucket (hidden via the Import API) and then Google Cloud converting that to a VM.
Architecting Disaster Recovery Solutions on Google Cloud
Some of the main considerations for leveraging the cloud as part of your DR strategy:
- The structure of how cloud resources are managed compared to on-premises
- The billing model and performance are based on the resources assigned to the machine
- Network and firewall polices
- Identity and access management
When these servers are converted to native instances, they must exist in a customer’s account. With on-premises, a business owns and manages the hardware supporting the environment. Different business units might all share hardware from resources managed by the Corporate IT team. Servers might have restricted access to just the appropriate users, however in the cloud there are many different options for managing resources logically. With Google Cloud, the top layer is an organization tied to the business, below that are folders and projects. Projects are the container of resources that VMs, storage, networking, and other services are tied to. With Google Cloud VMware Engine, this might be a corporate project that hosts the VMware Data Center that works like on-prem. For workloads recovered directly to Google Compute Engine, some additional planning might be required. A top-level folder could be assigned to different business units and Projects created for each application or a project might be created for a business offering that includes multiple applications that work together. How the environment is set up will vary depending on the structure and requirements of the business.
Servers recovered directly to GCE will have to have a machine type. These machine types come in dozens of preconfigured sizes, but custom sizes can be created as well. Each machine type has slightly different pricing as they offer different capabilities such as higher general-purpose, CPU optimized, memory dense, or accelerator optimized for machine learning/high performance computing. Virtual disk storage also comes in different characteristics with options around performance (HDD, SDD, etc) and durability (local ephemeral, zonal, regional). The more resources a server requires or has allocated, the higher the price typically, so right-sizing based on historical data can help optimize cost and ensure migrated systems run as expected.
Networks in the cloud will need to be setup ahead of time. During a restore, a VM will be automatically assigned a new IP address, which is likely to be different from its production network. During planning for the recovery process, applications with hardcoded IP addresses to communicate between servers/components will need to a plan in place to ensure the application can be brought online. Google Cloud allows very granular network communication options that can prevent servers even on the same subnet from talking to one another. Network labels can be used with firewall policies to allow communication for an application to talk to each component on specific ports without worrying about opening communication between every server on a network, enhancing security but proper planning is required to ensure applications can be recovered without much intervention.
Role-based access control in a customer-managed environment often relies on Active Directory or other LDAP-based authentication technology to grant someone access to a server and then group policy or local security policies to restrict permissions to a system. In the cloud, every operation that a user or server/service account can perform must be granted individually. Google Cloud’s IAM policy offers very predefined roles and custom roles that can grant users exactly the number of permissions needed to perform an operation. Examining and predefining the right permissions should be done in advance when migrating to the cloud for anyone who needs to log in to the Google Cloud console to manage resources and what permissions a VM might need to operate in the new environment.
Leveraging Google Cloud DR with Veeam
Traditional Disaster Recovery can be inflexible and costly to maintain. The cloud offers an on-demand consumption model that allows users to consume what they need when they need it while not worrying about maintaining physical aspects to an environment. Successful DR planning requires companies to examine their business and technical environment and weigh the impact of various scenarios including a total loss of a physical site. Upfront planning and leveraging the right solutions such as Google Cloud and Veeam can mean all the difference during difficult situations. Automation can take human error during a crisis out of the equation and help businesses meet their goals to survive a disaster.