The Cost of Zero RPO on Kubernetes

In this article, I will show that Portworx Metro Disaster Recovery (DR) based on storage replication may look like an appealing option for meeting this requirement, but it’s not the best choice. It will lead you to performance issues, increase costs and complexity, and lead to a more fragile architecture. For some disasters such as data corruption, ransomware or regional disaster, Portworx Metro DR may not protect you at all. However, there are two other pragmatic approaches: 1) choosing a platform that allows snapshot durability for very short RPO, and 2) integrating this requirement into your application architecture from the beginning for real zero RPO. I will explore both alternative options in this post.

What are RPO and RTO?

RPO and RTO are two important concepts for a robust disaster recovery strategy. 

They are both a time measure, but they address two very different things: 

Why are we seeking zero RPO but not zero RTO?

Achieving zero or near-zero RPO is important for critical services. Imagine buying a gift to be delivered to a friend’s address for his birthday. After a few hours, you come back to the website to check the status of your order, but the website is down because of a disaster. You reload the page and see the image below:

While this situation is inconvenient, you trust that within a few minutes to a few hours, everything will be back up and running and you’ll be able to check the status of your order. If this situation does not happen too often and if this RTO time is not too long, it’s acceptable.

However, say once the website is available, there’s no record of your transaction. The website does not support zero RPO, so it’s lost.

Now you’ll need to call customer service, which is likely flooded with similar inquiries, provide proof of payment, wait for reimbursement, and redo your transaction completely. It’s possible the gift will now arrive late.  Talk about a bad customer experience!

This example illustrates the impact of zero-RPO. In critical situations such as medical, defense or financial, the importance of zero-RPO becomes even more obvious.

How do you handle the zero RPO requirement?

There are three possible approaches to handle the zero RPO requirement:

  1. Ask your storage or data service to replicate the data synchronously in a recovery site.
  2. Don’t try to fulfill the requirement completely but reduce the intervals between each backup.
  3. Ask your application architecture to implement zero RPO by saving every new record in an external service (like a S3 bucket or a Kafka Broker) before processing them.

Let’s study each approach and their pros and cons.

1) Storage Replication or Data Service Replication

Storage Replication

Many storage solutions have a replication feature that is not well integrated with Kubernetes. An exception is Portworx, a solution that runs storage controllers in Kubernetes. 

When Portworx creates a persistent volume, it replicates it on three different nodes, called storage nodes. When a pod needs the volume, the Portworx scheduler extension (stork) chooses the node that is the closest to the volume, or  on the same node that holds the volume.

With this solution, the performance at reading should be very good, as long as the pod is scheduled in a storage node that holds the volume. But performance at writing will be poorer, as it needs to replicate three times across the network. 

If pods can’t be scheduled on a storage node that holds the volume, then both reading and writing performance will suffer. (I’ll come back to that later when we explore performance tests.)

With this replication pattern, Portworx can implement a storage replication solution that goes beyond the boundary of a single cluster.

Portworx Metro DR 

Note: I don’t speak about the Portworx Async DR solution because Portworx claims a 15-minute RPO in the best case – far from zero RPO, which you can attain easily with a professional backup tool such as Kasten.

Portworx provides a storage replication mechanism called Metro DR, which provides a way to distribute storage across multiple Kubernetes clusters. Basically, replicating your volume can be on another cluster, so you don’t have to move your volumes.

The idea is interesting because the storage architecture does not change. However, there are a few caveats:

  1. You have to manage a new component, the common Key Value Database (KVDB) you see in the figure above. Portworx needs an internal KVDB, such as etcd or consul. On a single cluster install, Portworx recommends using the existing Kubernetes etcd database, but that makes no sense in this case because you won’t want to give the second cluster access to the service, and if the primary cluster fails, the storage won’t be usable on the DR cluster, because the etcd database on the primary won’t be responding anymore.
  2. You need a network latency inferior to 10ms between the two Kubernetes cluster nodes, which has two embarrassing consequences: 1) Writing performance can be severely affected if you have irregular or superior latency and 2) If your latency is really good, it means that somehow your clusters are close to each other. In case of a disaster that affects the primary cluster, it will likely affect the secondary cluster.

Beside these issues, configuration and operations are complex. Enabling Metro DR is therefore not sufficient; you need to configure other components.

Cluster Domain 

The two clusters work with a single Portworx instance; however, you need to explain to Portworx how to replicate data, ensuring all data is replicated on both clusters. For each storage node, you must define which cluster_domain they belong to. When Portworx replicates the data, make sure there are at least two replicas in two different cluster domains. 

Cluster domains are also used to fail over an application, because you can’t have the main cluster and the DR cluster changing data in the same exposed volume. That would create a replication conflict.

Cloud Credentials, ClusterPair, Policy and Migration Schedule 

To complete the process of disaster recovery, Portworx features a Migration Schedule which consists of replicating a list of namespaces from the main cluster to the DR cluster, without restarting the workload. Otherwise, the DR cluster would work on the same exposed volume, which  would create a replication conflict.

Attach a Policy to the Migration Schedule to define the frequency of this specific replication.

Next, generate a ClusterPair on the DR, and apply it to the main cluster. The ClusterPair is actually a way to access the DR cluster for the main cluster. Finally, attach the ClusterPair to the Migration Schedule.

Two Command Line Interfaces

All of these operations are made with two command lines: pxctl and storkctl. This is particularly true when you decide to failover a namespace.

The image below illustrates these elements:

Data replication brings another layer of complexity to your architecture. In the end, you get a contrary result: by providing zero RPO, you actually increase the fragility of your architecture.

Portworx vs. GP2 Performance Test 

With this scenario, and with a short latency, you can see that the performance of this solution is similar to the performance of a single Portworx installation (in the best case). It’s time to run some storage performance tests and compare the performance with a cloud-native solution, such as AWS GP2.

To do my test I used an EKS cluster following Portworx instructions that gave me an EKS cluster with three storage nodes and four storageless nodes. I installed the 2.8 version of Portworx, as it was the default version proposed by Px-central. I also used Kubernetes 1.19 . 

Additionally, I used my favorite tool for the storage performance test on Kubernetes: Kubestr. Kubestr is not only about storage performance tests; it comes in handy when you need to quickly check the performance of a storage provisionner on Kubernetes. 

Kubestr leverages fio to perform the storage test. At the end of this post, I share the fio file I used.

kubestr fio --storageclass=portworx-sc \
   --size=100Gi \
   --fiofile=default-fio

Kubestr creates a PVC using the storage class and a pod attached to it that runs the fio test on the mount point of the PVC. After each test, Kubestr cleans everything, including the pod and PV.

My tests were simple. I first ran three simultaneous Kubestr commands and collected the results. Then I redid the operations for 6, 9 and 12 with simultaneous runs. On the chart below, I show the Band Width (BW) result for Rand Read, Rand Write and Sequential Write.

(You can find the complete collection results in the Appendix.)

Portworx Results

In graph 1 you can see Portworx performance at read and write for 3, 6, 9 and 12 simultaneous workload.

When running three simultaneous workloads, the read performance of Portworx is really impressive! It easily beats the read performance of GP2. But as soon as we run six simultaneous workloads, performance is cut in half. When we run 12 simultaneous workloads, performance barely reaches 19 000 KiB/s. The write performance decreases also, but in a similar fashion to GP2.

Due to the storage architecture of Portworx, with just three workloads running on this 7-worker nodes cluster, it’s easy for Portworx to place the workload and volumes on the same node. In doing so, we get excellent read performance and honorable write performance (at least one replica is very fast).

However, as soon as we increase simultaneous workloads, that’s not possible anymore. It would leave non-storage nodes completely idle. Because the storage nodes are now full, Portworx can’t schedule the workload on a storage node anymore and has to schedule it on a non-storage node. That’s why we see a performance drop during read operations, because there is no more this proximity between the workload and the storage.

GP2 Results 

In graph 2 you can see GP2  performance at read and write for 3, 6, 9 and 12 simultaneous workloads (Graph2). Performance decreases but in much more regular fashion for read operations.

If we compare portworx vs GP2 for 3 simultaneous workloads, we can see that Portworx outperforms GP2 on read operations (Rand Read BW PX vs Rand Read BW GP2), but it’s the contrary on write operations (Rand write BW PX vs Rand Write BW GP2 and Seq write BW PX vs Seq Write BW GP2 and) (see Graph 3).

If we compare Portworx to GP2 for 6 simultaneous workloads, we can see that Portworx outperforms GP2 on read operations (Rand Read BW PX vs Rand Read BW GP2), but it’s not a big difference anymore. GP2 is still much better in writing (see Graph 4).

At 12 simultaneous workloads, now GP2 outperforms Portworx on both read and write operations (see Graph 5).

I did not push the test further, because 12 intensive stateful workloads for a 7 node cluster seems sensible (see Graph 6).

With 3 simultaneous workloads Portworx performed better at read operations but GP2 performed better at write operations. At 12 simultaneous workloads, GP2 performed better at both read and write operations.

Of course, you can mitigate performance issues by adding more storage nodes, but that increases costs and still may not provide the performance you need. 

Corruption Propagation

A broken cluster isn’t the only possible cause of disaster. Your data can be corrupted by an application error or ransomware. Before even realizing there’s a problem, the corrupted data is already replicated. Replication is not Point-In-Time, and you won’t be able to recover to normal. This is a serious limitation to the synchronous replication approach.

Summary: The Pros and Cons of Storage Replication

Pros: 


Cons: 

Data Service Replication

Data service replication is based on redo logs or Write Ahead Logs (WAL) that you generate on the primary database. They send away to the database on the DR site.

With this approach, you face the same issue as with storage replication regarding performance.  

Several years ago an article was produced using a pgbench test to demonstrate the implications of this configuration on Postgres – and it was significant. With synchronous_commit set to “off”, they saw over 2X the performance that was observed when using synchronous_commit=remote_apply. The most common configuration, synchronous_commit=on, saw a figure that was in the middle.

The impacts of each configuration will vary, depending on what sort of load you’re producing, the rate at which this load generates WAL files, the speed of the network, and the machine resources for all servers in the cluster.   

Dataservice replication vs. storage replication 

2) Increase backup frequency 

Not a Real Zero RPO

We already know that this approach is not going to give us a zero RPO. Even if you increase backup frequency, you will still have an interval, which is your RPO. But it’s interesting to see how far we can go and remain disaster-proof without impacting the storage performance.

Snapshots Are Much Faster than Export. Can You Avoid Exporting to Increase Frequency?

A fundamental rule of backup is that you should keep a local backup (which is often a snapshot) and an external copy (a full or incremental copy as an export). Local copies enable you to restore quickly if they are still available. In case of a complete disaster, the external copy enables you to restore even if your storage is completely broken. 

Is a Snapshot a Durable Backup?

The external copy involves an export, which is time-consuming and the main hurdle on the path to reducing the RPO. You might wonder, is it possible to get rid of the export phases while remaining disaster proof? Can a snapshot be a durable backup? 

Most of the time, the answer is no. For instance, on the Google cloud platform, as soon as you delete a volume, all the snapshots you took from the volume vanish with it. 

But there are some exceptions. Here are two: 

Let’s Try! 

For this example, I’ll install Kasten K10 on EKS and trigger two 2-hour long pgBench tests. For the first run, I won’t set any backup policy. For the second run, I’ll use a policy with a backup frequency of once every 5 minutes and see if there is any difference in the number of transactions processed.

(In the Appendix, I describe how I installed Kasten K10 on EKS and how I set up the pgbench test.) 

Running this intensive pgbench tests enables us to measure performance under a lot of CREATE, SELECT and UPDATE transactions. 

When I launch pgbench with the policy, I set up a policy with a backup every 5 minutes, and I only retain the last 24 backups to ensure we have 2 hours coverage (24 x 5 = 120 min = 2h).

Results 

Without backup, I get around 825 transactions per second:

starting vacuum...end.
 transaction type: <builtin: TPC-B (sort of)>
 scaling factor: 100
 query mode: simple
 number of clients: 100
 number of threads: 1
 duration: 7200 s
 number of transactions actually processed: 5945865
 latency average = 121.119 ms
 tps = 825.635690 (including connections establishing)
 tps = 825.636295 (excluding connections establishing)

 With a backup every 5 minutes, I lose 1.8% of transactions per second:

starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 100
number of threads: 1
duration: 7200 s
number of transactions actually processed: 5838560
latency average = 123.372 ms
tps = 810.555473 (including connections establishing)
tps = 810.556049 (excluding connections establishing)

Let’s confirm by replacing 5 minutes with 15 minutes.

With a backup every 15 minutes, I’m almost in the same situation as with no backup. 

starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 100
query mode: simple
number of clients: 100
number of threads: 1
duration: 7200 s
number of transactions actually processed: 5932622
latency average = 121.364 ms
tps = 823.966355 (including connections establishing)
tps = 823.966904 (excluding connections establishing)

Summary: Pros and Cons of Frequent Backups

Pros: 

Cons:

As I said at the beginning of this section, I already knew that I was not going to reach a zero RPO with this approach. But when you have snapshot durability and snapshot replication conditions, you can achieve a  5-minute RPO, which in many use cases could be sufficient.

3)  Implement Zero RPO in Your Architecture

The third approach is exemplified for AWS by Carl Ward (an AWS architect) in this blog post: A pragmatic approach to RPO zero

The general idea is that all user transactions should be first committed to external storage (such as an S3 bucket) or a messaging stream service (such as Kafka), to guarantee a zero RPO. Once this is complete, your application batches this data and fills up the internal databases.  

If anything goes wrong then you ask your batch to re-process the data and refill all your databases, you can ask the batch to stop at an accurate point in time.

Even better, you can restore the database at your earliest valid backup and ask the batches to restart at a bigger offset. This way, you drastically reduce the time it takes to rebuild the data, thereby reducing your RTO. Additionally, you won’t see performance degrade inside your applications with storage or data replication.

On paper, this solution looks appealing, but remember that when you go through kafka for any new user entry, it can be challenging. You must carefully segregate new data from unprocessed data, and data processing must be idempotent.

Imagine a front-end application that needs the database to build a form. Even if the front-end application can read the database, it can’t write in it directly, for writing it must pass through kaka and the batch will create the record.

As you can see, the data flow is following a more complex path.  While it’s doable, it adds another layer of complexity and changes some processes.

Transaction Boundaries

Another big advantage of this approach is that it respects transaction boundaries. 

The other approaches do not understand the transaction boundaries that span multiple databases. When a transaction starts, it can affect volume 1 and volume 2 but not volume 3. Some data may be kept in memory and not yet flushed to storage. You can take snapshots or replicate the data without being aware of this pitfall, and the restored application could be inconsistent. 

But with this approach, you “replay” the user transactions during restore, so you reduce the risk of inconsistencies.

Summary: Pros and Cons of Implementing Zero RPO in Your Architecture

Pros:

Cons:

Conclusion 

Implementing a real zero RPO is challenging, especially if you want to maintain performance, the integrity of your architecture, and the user experience. 

In my opinion, building zero RPO on a data replication solution as Portworx is a bad idea. It brings more complexity, cost and performance issues than it resolves data loss. A pragmatic approach is often to put energy on security, monitoring, and high availability to reduce the risk of disaster, and opt for a short RPO rather than a zero RPO. 

If you really need a zero RPO, it’s important to identify which component really needs it. For example, for an online shop, the order system may need a real zero RPO but not the catalog system or the delivery management system. For the latter, a short RPO is enough. If after a recovery an order is present in the order system but not in the delivery system, you may be able to reconcile by other means.

Appendix

Fio file used for performance test:

 [global]
randrepeat=0
verify=0
ioengine=libaio
direct=1
gtod_reduce=1
[job1]
name=read_iops
bs=4K
iodepth=64
size=2G
readwrite=randread
time_based
ramp_time=2s
runtime=15s
[job2]
name=write_iops
bs=4K
iodepth=64
size=2G
readwrite=randwrite
time_based
ramp_time=2s
runtime=15s
[job3]
name=read_bw
bs=128K
iodepth=64
size=2G
readwrite=randread
time_based
ramp_time=2s
runtime=15s
[job4]
name=write_bw
bs=128k
iodepth=64
size=2G
readwrite=randwrite
time_based
ramp_time=2s
runtime=15s
[job5]
name=seq_write_bw
bs=128k
iodepth=64
size=2G
readwrite=write
time_based
ramp_time=2s
runtime=15s

Portworx performance collection:

GP2 performance collection:

Run pgbench test: 

#install kasten
kubectl create ns kasten-io
helm install k10 kasten/k10 --namespace=kasten-io \
   --set secrets.awsAccessKeyId="${AWS_ACCESS_KEY_ID}" \
   --set secrets.awsSecretAccessKey="${AWS_SECRET_ACCESS_KEY}"

#install postgres
helm repo add bitnami https://charts.bitnami.com/bitnami
kubectl create ns postgres-gp2
helm install postgres bitnami/postgresql \
 --namespace postgres-gp2  \
 --set global.storageClass=gp2

# check you can do a manual snapshot of the application with kasten
kubectl --namespace kasten-io port-forward service/gateway 8080:8000

# If successful launch a pgbench test
export POSTGRES_PASSWORD=$(kubectl get secret postgres-postgresql \
     --namespace postgres-gp2 \
     -o jsonpath="{.data.postgresql-password}" \
     | base64 --decode)

# create the pgbench database
kubectl exec -n postgres-gp2 postgres-postgresql-0 -it -- /bin/sh -c "PGPASSWORD=$POSTGRES_PASSWORD psql -U postgres -h 127.0.0.1 -p 5432"
> create database pgbench;
> exit
# populate pgbench data database with a scaling factor of 100
kubectl -n postgres-gp2 run pgbench-init \
 --env="PGPASSWORD=$POSTGRES_PASSWORD" \
 --restart=Never -it \
 --image=xridge/pgbench \
 -- --host postgres-postgresql -U postgres -p 5432 -i -s 100 pgbench
# Run the test for 2h and 100 clients
kubectl -n postgres-gp2 run pgbench-2h \
 --env="PGPASSWORD=$POSTGRES_PASSWORD" \
 --restart=Never \
 --image=xridge/pgbench \
 -- --host postgres-postgresql -U postgres -p 5432 -c 100 -T 7200 pgbench

# check the result
kubectl -n postgres-gp2 logs pgbench-2h

# create a kasten policy every 5 minutes and relaunch pgbench
kubectl -n postgres-gp2 run pgbench-2h-with-kasten-policy \
 --env="PGPASSWORD=$POSTGRES_PASSWORD" \
 --restart=Never \
 --image=xridge/pgbench \
 -- --host postgres-postgresql -U postgres -p 5432 -c 100 -T 7200 pgbench

# check the result
kubectl -n postgres-gp2 logs pgbench-2h-with-kasten-policy

Interested in investing in a Kubernetes native backup and ransomware data protection solution? Discover Veeam Kasten, the free #1 Kubernetes backup here!

Exit mobile version