Skip to main content

Getting your data onto the Analytical Platform

What is the Analytical Platform?

The Analytical Platform is a data analysis environment, providing modern tools and key datasets for MoJ analysts

Read more at, but basically it gives MoJ’s data science teams access to data created by your service.

How does data from my service get onto the Analytical Platform?

The Analytical Platform requires a periodic (no more frequently than daily) dump of your service’s database.

  1. A cronjob will extract the data from your database, saving it to a specific location in an S3 bucket
  2. An ETL pipeline will load the data from the S3 bucket and load it into the data platform

The data dump can be carried out using the MoJ Data Extraction Tool. When configured, this takes your database and exports it in JSON lines format to a pre-configured S3 bucket.

The ETL pipeline is part of the Register My Data Service. When configured, this will pull the data from the S3 bucket and load it into the Analytical Platform.

The data dump and the pipeline share an AWS role which enables sharing of the database extracts.

How do I setup my service to send data to the Analytical Platform?

1. Create the required AWS resources to allow the Data Extractor to extract the data to an S3 bucket

You’ll need to create a user (and associated access policy) to write to the S3 bucket at the specific location from within your applications cloud platform environment.

Conventially, add an in your environment’s resources folder. Here’s an example file

You’ll need to change the details to match your service, notably the values for the s3 bucket, which should be in the form “moj-reg-/landing/-” (eg for the my-test-service’s preprod environment this should be “moj-reg-preprod/landing/my-test-service-preprod/*”). moj-reg-preprod stands for MoJ Register My Data PreProd.

As part of the deployment terraform creates a k8s secret called analytical-platform-reporting-s3-bucket in your environment. Check it’s been created using

kubectl get secrets -n

Decode the secret and note the value of the arn_user secret. This will be needed for step 3.

cloud-platform decode-secret -n <YOUR SERVICE NAME>-<ENVIRONMENT> -s analytical-platform-reporting-s3-bucket

2. Add the data extractor to your environment

The data extractor is deployed within your service’s namespace - it extracts the data from your service’s database and places it in the S3 bucket specified in the previous step

The data extractor is deployed as a docker image. You configure it using environment variables that are set from a combination of the k8s secrets created in the previous step, and from your applications’ configuration

Images to be deployed into an environment are configured using helm. You’ll need to create a new helm template for the data extractor and provide configuration it to use the secrets for the database access and s3 bucket access

By convention this file is called data-extractor-job.yaml and is saved in the templates directory of your helm_deploy. Here is an example data-extractor-job.yaml.

Don’t forget to change the values to suit your service’s needs - notably the app label and the names of the k8s secrets that are used to configure the data extractor.

For each environment (eg dev, preprod, prod) that you want the data extractor to run in you’ll need to enable it by setting the dataExtractor.enabled value to true. eg

  enabled: true

3. Create the ETL pipeline

The Register My Data service pulls your exported data from the S3 bucket and pushes it into the Analytical Platform. You create a pipeline for your service by editing the config.yaml file in the relevant moj-reg stack (ie moj-reg-dev, moj-reg-preprod, moj-reg-prod).

eg to configure for dev you’d add your service to stacks/moj-reg-dev/config.yaml. You will need the ARN of the user created by terraform in step 1.

This page was last reviewed on 9 August 2023. It needs to be reviewed again on 9 February 2024 by the page owner #cloud-platform .
This page was set to be reviewed before 9 February 2024 by the page owner #cloud-platform. This might mean the content is out of date.