Backing up your Kubernetes applications is an essential part of the development process. When you run a backup, an import or a restore manually, the results are instantly available — you simply check your dashboard to see if the backup was successful or not. But when you run a backup operation as part of a daily policy, it’s not as transparent. Everything runs in the background while you’re completing other tasks. If anything goes wrong, you won’t know about it until later when you finally take the time to check your dashboard. While you may think everything worked the way it was supposed to, in reality, your application is not protected anymore!
Backups fail for a number of reasons. Here are just a few common ones:
- Credentials change
- Storage failure (system full or quota exceeded)
- Pods that stay in errors without you knowing it
- Target backup moved or deleted
- Network failure
- Misconfiguration changes
- Timeout because storage size increased in an unexpected way
Because backups can fail for so many different reasons, we recommend implementing alerts when you move to production. While developers use different alerting systems and there’s no one generic way to set up alerts end-to-end, with a few configuration changes, you can implement alerts that go directly to your incident system.
In this blog post, I’ll walk you through how to set up alerts to be sent directly to a Slack channel. The same steps can be taken to send alerts via email, or to another system such as ServiceNow, Pager Duty, or JIRA.
Implementation
Kasten features a Prometheus instance that stores Kasten metrics. We use those metrics to produce graphs on storage usage, as well as provide information for detecting whether or not a backup failed. The image below shows how alerts appear in the Slack channel:
In the following example, we’ll configure Alert Manager (a Prometheus component) to send alerts to a Slack channel in four easy steps:
1) Create a New Prometheus Instance and Federate it to the Kasten.io Instance
When you do this, do not modify the Prometheus configuration in the Kasten.io namespace, because this instance is managed by helm and not intended to be modified manually. Instead, use the federation URL of the Kasten Prometheus instance to spin up a Prometheus instance in another namespace:
2) Enable WebHooks in Slack
WebHooks in Slack allow you to send messages to a Slack channel, which is what you will configure Alert Manager to do.
First, create a Slack channel then from the Apps menu add the incoming WebHooks app. Be sure to take note of the WebHooks URL. You’ll need it later:
Once that’s done, you’ll see something like this in Slack:
3) Configure Prometheus and Alert Manager
Create the monitoring namespace:
kubectl create ns monitoring
Then add the Prometheus community helm chart:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
Now, let’s configure a Prometheus instance with:
- Targets to scrap the metrics exposed by the Kasten Prometheus instance (the “federate” arrow).
- Alert Manager with a Slack receiver (the “Receive” arrow).
- Disablement of all the other metrics that Prometheus usually scrapes in a Kubernetes cluster.
Create the kasten_prometheus_values.yaml, replace the <slack_api_webhook_url> by the webhook URL you just obtained:
cat <<EOF > kasten_prometheus_values.yaml
defaultRules:
create: false
alertmanager:
config:
global:
resolve_timeout: 5m
route:
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 30m
# A default receiver
receiver: "slack-notification"
routes:
- receiver: "slack-notification"
match:
severity: kasten
receivers:
- name: "slack-notification"
slack_configs:
#configure incoming webhooks in slack(https://slack.com/intl/en-in/help/articles/115005265063-Incoming-webhooks-for-Slack)
- api_url: '<slack_api_webhook_url>'
channel: '#channel' #channel which the alert needs to be sent to
text: "{{ range .Alerts }}<!channel> {{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}"
prometheus:
prometheusSpec:
additionalScrapeConfigs:
#federation configuration for consuming metrics from K10 prometheus-server
- job_name: k10
scrape_interval: 15s
honor_labels: true
scheme: http
metrics_path: '/k10/prometheus/federate'
params:
'match[]':
- '{__name__=~"jobs.*"}'
- '{__name__=~"catalog.*"}'
static_configs:
- targets:
- 'prometheus-server.kasten-io.svc.cluster.local'
labels:
app: "k10"
#below values are to disable the components which are not required. It can be changed based on the
requirement.
grafana:
enabled: false
kubeApiServer:
enabled: false
kubelet:
enabled: false
kubeStateMetrics:
enabled: false
kubeControllerManager:
enabled: false
kubeEtcd:
enabled: false
kubeProxy:
enabled: false
coreDns:
enabled: false
kubeScheduler:
enabled: false
EOF
Finally, install Prometheus with this configuration:
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring -f kasten_prometheus_values.yaml
4) Create a Rule
The final step is to create a prometheusRule CR to configure the alerts:
cat << EOF | kubectl -n monitoring create -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app: kube-prometheus-stack
release: prometheus
name: prometheus-kube-prometheus-kasten.rules
spec:
groups:
- name: kasten_alert
rules:
- alert: KastenJobsFailing
expr: |-
increase(catalog_actions_count{status="failed"}[10m]) > 0
for: 1m
labels:
severity: kasten
annotations:
summary: "More than 1 failed K10 jobs for policy {{ \$labels.policy }} for the last 10 min"
description: "{{ \$labels.policy }} policy run for the application {{ \$labels.app }} failed in last 10 mins"
EOF
All done! Now, it’s time to test the configuration.
Testing the Alert
There are three steps to testing to make sure your alerts are working.
1) Create a Fail Condition
First, change the configuration of the AKS cluster and set a wrong client secret. You can do the same thing with the AWS secret access key. If you are using CSI, remove the Kasten annotation on the Volumesnapshotclass. (There are other ways to obtain a failure, but this one is really simple):
helm upgrade k10 kasten/k10 -n kasten-io -f azure_val.yaml
Once that’s done, run a backup. It should fail quickly:
2) Check Alert Manager
Port forward the Prometheus dashboard:
kubectl port-forward service/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
Then, navigate to http://localhost:9090/alerts. Under the Alert Manager tab, you should see your alert in pending or firing start:
3) Check Slack
Shortly after that, you should receive a notification in Slack. So, go check your Slack channel! If you see the alert, it’s working properly:
Conclusion
Now that you know how to receive alerts in Slack, you can safely run backups in the background as part of your daily policy, and be confident your data and applications are protected at all times.
Try Kasten K10 for yourself, for free today.
This article was co-authored by Jaiganesh Karthikeyan, Senior Software Engineer, InfraCloud Technologies.