Skip to main content

Using the Cloud Platform Monitoring and Alerting stack

Introduction

This is a brief introduction to the Cloud Platform Monitoring and Alerting stack: Prometheus, AlertManager, Grafana, Thanos and Pushgateway.

Below is a high-level overview

Monitoring stack

Prometheus

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. The Cloud Platform uses Prometheus Operator from CoreOS which allows a number of Prometheus instances to be installed on a cluster (although we currently use a single Prometheus instance for the whole cluster).

Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data, or to generate alerts. Grafana or other API consumers can be used to visualize the collected data.

To export metrics from your application into the Cloud Platform Prometheus, see: Getting application metrics into Prometheus

AlertManager

AlertManager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as PagerDuty. It also takes care of the silencing and inhibition of alerts.

To create and send application-specific alerts to a Slack channel see: Creating your own custom alerts

Thanos

Thanos is an open source project that is capable of integrating with a Prometheus deployment, enabling a long-term, scalable storage. By adding Thanos sidecar to Prometheus, it uploads the data every two hours to storage(an S3 bucket in our case). It also serves real-time metrics that are not uploaded in bucket.

Thanos Components including Thanos querier, Store and Compactor has been installed in Cloud Platform monitoring stack.

Thanos Querier which is similar to Prometheus is able to query a Prometheus instance(through sidecar) and a Thanos Store at the same time. It allows you to query several months worth of Prometheus metrics and create Grafana dashboards based on that.

Thanos Compactor applies the compaction procedure to Prometheus block data stored in an S3 bucket. Thanos also downsampling data that are stored in the s3 bucket. For each resolutions available, here is the retention period

raw - 30 days

5m(five minutes) resolution - 6 months

1h(one hour) resolution - 1 year

Because of the downsampling of data blocks, using Grafana dashboards is not suitable for monitoring the uptime of the application for monthly/quarterly/yearly basis.

Grafana

Grafana allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data driven culture.

Creating dashboards

Grafana is set up as a stateless app, managed entirely through code. This helps achieve better availability, but it means that dashboards you create will not automatically persist in its database.

To create a dashboard:

  1. Login to Grafana (see the links below) with your GitHub account. All users are able to edit dashboards but cannot save the changes. Find the dashboard titled ‘Blank Dashboard’ and modify it as you see fit. Select “Prometheus” as datasource for querying metrics past 24 hours. If you want to query long-term metrics, select the “Thanos” as datasource.

  2. When you are happy with your dashboard, click the share icon on the top right corner, select the Export tab, check the Export for sharing externally box and click on View JSON. Copy the resulting JSON string into a ConfigMap in your namespace.

    To do this, create a YAML file like this:

    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: <my-dashboard>
      namespace: <my-namespace>
      labels:
        grafana_dashboard: ""
    data:
      my-dashboard.json: |
        {
          [ ...your JSON document goes here, minus the outermost braces... ]
        }
    

    Make sure that the JSON string is properly indented. Also, name of the key in the ConfigMap must end in -dashboard.json (as in my-dashboard.json above). Please note that you can have multiple dashboards exported in a single ConfigMap as well.

  3. Use kubectl to apply the ConfigMap above, and your dashboard should be visible in Grafana shortly.

To build dashboards for your Cloudwatch metrics, see: Using the CloudWatch data source in Grafana

To build dashboards for long-term metrics, select the “Thanos” as datasource.

Pushgateway

Pushgateway allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus.

You can install pushgateway into your namespace by adding the module to your resources folder of cloud-platform-environments repo.

How to access monitoring tools

All links provided below require you to authenticate with your Github account.

Prometheus: https://prometheus.cloud-platform.service.justice.gov.uk

AlertManager: https://alertmanager.cloud-platform.service.justice.gov.uk

Grafana: https://grafana.cloud-platform.service.justice.gov.uk

Further documentation

Prometheus querying

This page was last reviewed on 23 March 2021. It needs to be reviewed again on 23 June 2021 .
This page was set to be reviewed before 23 June 2021. This might mean the content is out of date.