Skip to main content

Creating your own custom alerts

Overview

Alertmanager allows you to define your own alert conditions based on Prometheus expression language expressions.

The aim of this document is to provide you with the necessary information to create and send application-specific alerts to a Slack channel of your choosing.

Prerequisites

This guide assumes the following:

Create a slack webhook, and amend Alertmanager

  1. Create a slack webhook - Set up incoming webhooks if you don’t already have one for the channel you want to send the alerts.
  • Follow the steps in “Set up incoming webhooks” (starting with “Create an app” “From scratch”)
  1. Create a kubernetes secret to store the slack webhook.

  2. Create a ticket to request a new alert route in Alertmanager. The Cloud Platform team will need the following information:

  • namespace name
  • team name
  • application name
  • slack channel
  • kubernetes secret name where the webhook url is stored
  • severity level (warning/information)

If you want an event/information type slack message for monitoring non-problem/non-failure type events - e.g. a team wants to positively know something happened (like a database refresh, or app deployment etc), specify the severity level as information

The team will provide you with a “custom severity level” that will need to be defined in the next step. Please copy it to your clipboard.

Create a PrometheusRule

A PrometheusRule is a custom resource that defines your triggered alert. This file will contain the alert name, promql expression and time of check.

To create your own custom alert you’ll need to fill in the template below and deploy it to your namespace (tip: you can check rules in your namespace by running kubectl get prometheusrule -n <namespace>).

  • Create a file called prometheus-custom-rules-<application_name>.yaml
  • Copy in the template below and replace the bracket values, specifying the requirements of your alert. The <custom_severity_level> field is the value you were given earlier, by the cloud platform team.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  namespace: <namespace>
  labels:
    role: alert-rules
  name: prometheus-custom-rules-<application_name>
spec:
  groups:
  - name: application-rules
    rules:
    - alert: <alert_name>
      expr: <alert_query>
      for: <check_time_length>
      labels:
        severity: <custom_severity_level>
      annotations:
        message: <alert_message>
        runbook_url: <http://my-support-docs>
        dashboard_url: <http://my-dashboard>
  • Run:
kubectl apply -f prometheus-custom-rules-<application_name>.yaml -n <namespace>

A working example of this would be:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: null
  namespace: test-namespace
  labels:
    role: alert-rules
  name: prometheus-custom-rules-my-application
spec:
  groups:
  - name: node.rules
    rules:
    - alert: Quota-Exceeded
      expr: 100 * kube_resourcequota{job="kube-state-metrics",type="used",namespace="monitoring"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 90
      for: 5m
      labels:
        severity: cp-team
      annotations:
        message: Namespace {{ $labels.namespace }} is using {{ printf "%0.0f" $value}}% of its {{ $labels.resource }} quota.
        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/runbook.md#alert-name-kubequotaexceeded
        dashboard_url: https://grafana.live.cloud-platform.service.justice.gov.uk/d/application-alerts/application-alerts?orgId=1&var-namespace=cloud-platform-reference-app

The alert name, message, runbook_url and dashboard_url annotations will be sent to the Slack channel when the rule has been triggered.

For information type alerts, you can also add a label status_icon: information to have an information icon on the alert title.

You can view the applied rules with the following command:

kubectl -n <namespace> describe prometheusrules prometheus-custom-rules-<application_name>

Information type alerts

If you want to use the same Slack Channel and webhook to send information type alerts, you can do so by adding a label severity: info-<custom_severity_level> to the prometheus rule.

If the custom-severity provided is cp-team, an example of this would be:

apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      namespace: test-namespace
    labels:
      role:
    name: prometheus-custom-rules-secretsmanager
    spec:
    groups:
      - name: application-rules
      rules:
      - alert: SecretsManagerPutSecretValue
        expr: secretsmanager_put_secret_value_sum{exported_job="secretsmanager", secret_id="arn:aws:secretsmanager:eu-west-2:754256621582:secret:<your-secret-arn>"} > 0
        for: 1m
        labels:
          severity: info-cp-team
        annotations:
          message: |
            {{ $labels.secret_id }} has had {{ $value }} PutSecretValue operations recently.
            {{ $labels.user_arn }} has had {{ $value }} PutSecretValue operations recently.
          runbook_url: <runbook_url>
          dashboard_url: <dashboard_url>

PrometheusRule examples

If you’re struggling for ideas on how and which alerts to setup, please see some examples here

Advisory Note 1: If Prometheus gets re-installed

In the event of a serious failure, or for a required upgrade, it may be necessary to re-install Prometheus.

PrometheusRule is a CRD that declaratively defines a desired Prometheus rule.

If for any reason Prometheus has to be re-installed, all PrometheusRules are removed with the CRD.

We recommend that all PrometheusRules are added to your namespace folder in the Environments Repo. This will ensure all rules are applied/present at all times.

PrometheusRules can still be tested/amended/applied manually, then a PR can be created to add to the Environments Repo when ready.

Advisory Note 2: CPUThrottlingHigh Alert

The CPUThrottlingHigh alert is configured as part of the default rules when installing prometheus-operator.

This alert can trigger if containers have low CPU limits, and/or spiky workloads but with very low average usage. CPU throttling can activate during those spikes. CPU usage is based on CFS.

If you think this may be causing an issue with your application, we recommend raising your CPU limit, whilst keeping the container CPU request as close to the 95%-ile average usage as possible.

Further reading

This page was last reviewed on 23 February 2024. It needs to be reviewed again on 23 August 2024 by the page owner #cloud-platform .
This page was set to be reviewed before 23 August 2024 by the page owner #cloud-platform. This might mean the content is out of date.