What is the Best Way to Protect your Kubernetes Cluster Against Disaster?

Dave Smith-Uchida

4 months ago

I have been working on this Kubernetes Data Protection Working Group (WG) for four years. As part of SIG-Storage, this WG covers the backup and recovery of Kubernetes and cloud-native workloads. A frequent ask to the WG is, “how do I back up my etcd database?”

Often, the requestor ends up wanting something other than the direct answer to this simple question. I am going to say it up-front: Protecting your Kubernetes cluster by backing up and restoring the etcd database is not a great solution. There are better ways to protect your cluster, so let us examine what is really needed and how to best solve it.

Kubernetes consists of a set of resources and controllers that respond to changes in your resources by acting and updating the relevant resources with the results. These resources reside in the Kubernetes API server, which commonly uses the etcd distributed database for storage.

A key difference between Kubernetes and traditional systems is that Kubernetes is not self-contained. It interacts with and controls outside resources, which is quite different from traditional systems like applications installed inside a virtual machine (VM). In traditional systems, backing up the VM disks effectively backed up the entire application because there was nothing external connected to it. In other words, the entire application was deployed as a single, monolithic VM without external dependencies. Restoring the disks to a point-in-time meant restoring the application to that point-in-time. To capture an entire Kubernetes cluster, external entities such as worker nodes, load balancers, virtual disks, databases, and many others need to be protected too. This includes the relationship between all your Kubernetes resources and the external entities — it is a distributed system.

Loss of etcd means the loss of all your Kubernetes resources – which means the Kubernetes cluster is no longer functional. Some applications may continue to function with the Kubernetes control plane offline, but you would not be able to make changes or fix problems.

Recovering from the loss of the etcd database is a critical capability for anyone running Kubernetes in production, and you need to have a plan for dealing with that situation. The opening question now evolves to, “how can I back up and restore etcd?” Let us dig into the potential sources of loss and match them to the correct solution to protect your Kubernetes cluster.

How Can the etcd Database/API Server be Lost?

There are major four sources of failure:

Storage issue: The storage that the etcd database runs on is compromised. While etcd is highly resilient to failure, by spreading its data across multiple nodes, it is also important that you ensure the storage itself is distributed properly. For example, if all the volumes for the etcd database are hosted in the same disk array, losing that disk array is a single point of failure. In a cloud deployment, spreading etcd across multiple availability zones is necessary to avoid complete failure.
Corruption: If the etcd database becomes corrupted, the cluster will become unusable.
Human error: Deletion of critical resources or mistakes during patching can leave the cluster in a non-functional state.
Loss of quorum: etcd requires that a minimum quorum of nodes always be available. If the minimum quorum is not available, the etcd database will shut down, and require manual intervention to bring the database back online.

Ways to Recover the Kubernetes State in the API server

Reapply the Resources

This is the classic Kubernetes answer, sometimes referred to as “GitOps.” If all your resource definitions have been stored properly as YAML files in git or another version control system, you can create a new Kubernetes cluster, re-apply the resources, and Kubernetes can restart the applications in your cluster.

However, any state not stored in git will be lost. For example, it is common to have a database or other data service running in your cluster which stores data in dynamically allocated persistent volumes. These persistent volumes will then be re-created as an empty database when your Kubernetes resource definitions are reapplied. If your application keeps state in the Kubernetes API server and your configuration was not kept in git (e.g., your config maps or custom resources were updated after deployment), then that state will be lost when the resource definitions are reapplied.

As Kubernetes evolves, state has also begun to creep in in ways that are not always obvious. Consider a Kubernetes cluster that uses a cloud database service. Originally, the database would be created manually, and the endpoint and credentials would be stored in git. However, the scale of this system has now outgrown a single database and Kubernetes cluster. The use of a Kubernetes operator to deploy and manage the cloud database was introduced, and the information in git was changed to make the database operator and database custom resources authoritative for database configuration. This means that the operator now handles the creation of your external database and stores the endpoint and credentials in the Kubernetes cluster. These values are also unique to each cluster and no longer in git! This means that GitOps can no longer restore the cluster to its working state in this example. GitOps is still valuable for managing your configuration and rolling out upgrades — it is just not a disaster recovery (DR) solution.

The next challenge is understanding your use cases. Reapplying resource definitions can be used in two ways: Creating a new cluster and trying to recover an old cluster. When the cluster is truly stateless, there is no real difference between the two. However, once state has been introduced, recovering an existing cluster often means you will want the state that is stored by the cluster to be recovered as well. Even if all state is external to Kubernetes, you will want the cluster to reattach to any external entities instead of creating new, empty entities. This is a vastly different scenario than creating a new cluster, and just reapplying resource definitions will not give you the results you want.

Backup of etcd via etcd Volume Backup or etcd Backup/Restore Tools

This approach dumps the state of the etcd database so you can bring it back to the same state it was in at the time of the backup. All states stored in etcd will be backed up and restored, including Kubernetes state that was not kept in git. While this may seem like good protection because of the nature of Kubernetes, it is fragile, even when protecting Kubernetes resources. The only way a cluster can be recovered is if the persistent volumes, worker nodes, and any other external state remain the same and can be readopted properly. Kubernetes is dynamic and changes can happen at any time. If this happens, recovering the cluster from the etcd backup will result in a broken cluster.

So, what kind of changes can happen? Pods can move between worker nodes and any volumes they use are then mounted on different worker nodes. The number of volumes may change, and worker nodes can be added or removed from the system.

The etcd approach is limited in other ways as well. You will not be able to restore a single namespace or other subparts of the cluster and rollback to an old backup of the etcd database usually will not work. You also will not be able to duplicate the cluster, even for testing, because restoring the etcd database into another cluster will not work.

Furthermore, many managed Kubernetes offerings do not allow direct access to the etcd database anymore.

Backing up API Server Resources with External State

Most Kubernetes backup products, including Veeam Kasten for Kubernetes, do this because it is the most flexible approach and covers most scenarios. Resources are accessed via the API server and backed up. External data stores like persistent volumes are recognized during the backup process and stored as well, and the application state stored in the API server is backed up. On restore, since the backup application is working at the resource level, it can understand the resources and can transform them as necessary. This allows for DR, rollback, and clone/migration cases to be handled. Pieces of the cluster can be restored individually too, and you can restore a single application, namespace, or item selected by a label.

Which One Should You Use?

Backing up API Resources with External State

This covers most cases: you can survive the loss of datacenters and storage and recover from ransomware and mistakes. You can recover into an existing cluster or create a new cluster and roll back to older configurations of the cluster. Plus, all the data stored in your cluster (not just the data in etcd) can be protected. We recommend that you protect your clusters with a purpose-built tool like Veeam Kasten that uses this approach. This method backs up all the data and etcd dump would protect and more: effectively driving application recovery that can work with any cluster and in many other scenarios. Plus, it is compatible with GitOps workflows.

GitOps

If your application is truly stateless, GitOps can be your recovery plan. For this to work, you need to be diligent about keeping your application truly stateless and ensuring that your recovery plan will work as expected. Outside of DR, GitOps is a good practice to have and should be considered even if you have a separate backup/restore system in place.

etcd Database Backup

There are very few cases where this is a promising idea since this is not a very comprehensive approach. If your only concern is the immediate DR of your cluster, etcd backup/restore can get you back to a working state quickly. However, you will need to be deeply knowledgeable about Kubernetes and be able to reintegrate a working cluster. If you have any state outside of your etcd database, you run the real risk of losing it. If you do not take backups of your etcd database after every change to the cluster, you can wind up in a broken state. As Kubernetes applications become more sophisticated, the cluster gets changed more often.

Protect Your Kubernetes Cluster the Right Way

We recommend you use a tool designed for protecting Kubernetes clusters rather than trying to handle it with the etcd dump approach. One thing to remember is that it is not the backup, but the restore where you will feel the pain. You should test your recovery plan and make sure that you can recover from a variety of scenarios. Hopefully, you will never have to recover from a disaster, but most of us will and it never happens at a convenient time. Making sure that your recovery process works smoothly will make you much happier when you need to use it.

To learn more about data protection workflows for Kubernetes, check out this white paper by the Kubernetes Data Protection Working Group, available free on GitHub. After reading the paper, you will be an expert in understanding why Kubernetes data protection is needed, what’s currently available, and what functionalities are missing in Kubernetes to support data protection.