Creating your own custom alerts
Overview
Alertmanager allows you to define your own alert conditions based on Prometheus expression language expressions.
The aim of this document is to provide you with the necessary information to create and send application-specific alerts to a Slack channel of your choosing.
Prerequisites
This guide assumes the following:
Create a slack webhook, and amend Alertmanager
- Create a slack webhook - Set up incoming webhooks if you don’t already have one for the channel you want to send the alerts.
- Follow the steps in “Set up incoming webhooks” (starting with “Create an app” “From scratch”)
Create a kubernetes secret to store the slack webhook.
Create a ticket to request a new alert route in Alertmanager. The Cloud Platform team will need the following information:
- namespace name
- team name
- application name
- slack channel
- kubernetes secret name where the webhook url is stored
- severity level (warning/information)
If you want an event/information type slack message for monitoring non-problem/non-failure type events
- e.g. a team wants to positively know something happened (like a database refresh, or app deployment etc), specify the severity level as information
The team will provide you with a “custom severity level
” that will need to be defined in the next step. Please copy it to your clipboard.
Create a PrometheusRule
A PrometheusRule
is a custom resource that defines your triggered alert. This file will contain the alert name, promql expression and time of check.
To create your own custom alert you’ll need to fill in the template below and deploy it to your namespace (tip: you can check rules in your namespace by running kubectl get prometheusrule -n <namespace>
).
- Create a file called
prometheus-custom-rules-<application_name>.yaml
- Copy in the template below and replace the bracket values, specifying the requirements of your alert. The
<custom_severity_level>
field is the value you were given earlier, by the cloud platform team.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
namespace: <namespace>
labels:
role: alert-rules
name: prometheus-custom-rules-<application_name>
spec:
groups:
- name: application-rules
rules:
- alert: <alert_name>
expr: <alert_query>
for: <check_time_length>
labels:
severity: <custom_severity_level>
annotations:
message: <alert_message>
runbook_url: <http://my-support-docs>
dashboard_url: <http://my-dashboard>
- Run:
kubectl apply -f prometheus-custom-rules-<application_name>.yaml -n <namespace>
A working example of this would be:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: null
namespace: test-namespace
labels:
role: alert-rules
name: prometheus-custom-rules-my-application
spec:
groups:
- name: node.rules
rules:
- alert: Quota-Exceeded
expr: 100 * kube_resourcequota{job="kube-state-metrics",type="used",namespace="monitoring"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 90
for: 5m
labels:
severity: cp-team
annotations:
message: Namespace {{ $labels.namespace }} is using {{ printf "%0.0f" $value}}% of its {{ $labels.resource }} quota.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/runbook.md#alert-name-kubequotaexceeded
dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/application-alerts/application-alerts?orgId=1&var-namespace=cloud-platform-reference-app
The alert
name, message
, runbook_url
and dashboard_url
annotations will be sent to the Slack channel when the rule has been triggered.
For information type alerts, you can also add a label status_icon: information
to have an information icon on the alert title.
You can view the applied rules with the following command:
kubectl -n <namespace> describe prometheusrules prometheus-custom-rules-<application_name>
Information type alerts
If you want to use the same Slack Channel and webhook to send information type alerts, you can do so by adding a label severity: info-<custom_severity_level>
to the prometheus rule.
If the custom-severity provided is cp-team
, an example of this would be:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
namespace: test-namespace
labels:
role:
name: prometheus-custom-rules-secretsmanager
spec:
groups:
- name: application-rules
rules:
- alert: SecretsManagerPutSecretValue
expr: secretsmanager_put_secret_value_sum{exported_job="secretsmanager", secret_id="arn:aws:secretsmanager:eu-west-2:754256621582:secret:<your-secret-arn>"} > 0
for: 1m
labels:
severity: info-cp-team
annotations:
message: |
{{ $labels.secret_id }} has had {{ $value }} PutSecretValue operations recently.
{{ $labels.user_arn }} has had {{ $value }} PutSecretValue operations recently.
runbook_url: <runbook_url>
dashboard_url: <dashboard_url>
PrometheusRule examples
If you’re struggling for ideas on how and which alerts to setup, please see some examples here
Advisory Note 1: If Prometheus gets re-installed
In the event of a serious failure, or for a required upgrade, it may be necessary to re-install Prometheus.
PrometheusRule
is a CRD that declaratively defines a desired Prometheus rule.
If for any reason Prometheus has to be re-installed, all PrometheusRules are removed with the CRD.
We recommend that all PrometheusRules are added to your namespace folder in the Environments Repo. This will ensure all rules are applied/present at all times.
PrometheusRules can still be tested/amended/applied manually, then a PR can be created to add to the Environments Repo when ready.
Advisory Note 2: CPUThrottlingHigh Alert
The CPUThrottlingHigh
alert is configured as part of the default rules when installing prometheus-operator.
This alert can trigger if containers have low CPU limits, and/or spiky workloads but with very low average usage. CPU throttling can activate during those spikes. CPU usage is based on CFS.
If you think this may be causing an issue with your application, we recommend raising your CPU limit, whilst keeping the container CPU request as close to the 95%-ile average usage as possible.