Getting your data onto the Analytical Platform
What is the Analytical Platform?
The Analytical Platform is a data analysis environment, providing modern tools and key datasets for MoJ analysts
Read more at https://user-guidance.analytical-platform.service.justice.gov.uk/, but basically it gives MoJ’s data science teams access to data created by your service.
How does data from my service get onto the Analytical Platform?
The Analytical Platform requires a periodic (no more frequently than daily) dump of your service’s database.
- A cronjob will extract the data from your database, saving it to a specific location in an S3 bucket
- An ETL pipeline will load the data from the S3 bucket and load it into the data platform
The data dump can be carried out using the MoJ Data Extraction Tool. When configured, this takes your database and exports it in JSON lines format to a pre-configured S3 bucket.
The ETL pipeline is part of the Register My Data Service. When configured, this will pull the data from the S3 bucket and load it into the Analytical Platform.
The data dump and the pipeline share an AWS role which enables sharing of the database extracts.
How do I setup my service to send data to the Analytical Platform?
1. Create the required AWS resources to allow the Data Extractor to extract the data to an S3 bucket
You’ll need to create a user (and associated access policy) to write to the S3 bucket at the specific location from within your applications cloud platform environment.
Conventionally, add an analytical-platform.tf in your environment’s resources folder. Here’s an example file
You’ll need to change the details to match your service, notably the values for the s3 bucket, which should be in the form “moj-reg-
As part of the deployment terraform creates a k8s secret called analytical-platform-reporting-s3-bucket in your environment. Check it’s been created using
kubectl get secrets -n
Decode the secret and note the value of the arn_user secret. This will be needed for step 3.
cloud-platform decode-secret -n <YOUR SERVICE NAME>-<ENVIRONMENT> -s analytical-platform-reporting-s3-bucket
2. Add the data extractor to your environment
The data extractor is deployed within your service’s namespace - it extracts the data from your service’s database and places it in the S3 bucket specified in the previous step
The data extractor is deployed as a docker image. You configure it using environment variables that are set from a combination of the k8s secrets created in the previous step, and from your applications’ configuration
Images to be deployed into an environment are configured using helm. You’ll need to create a new helm template for the data extractor and provide configuration it to use the secrets for the database access and s3 bucket access
By convention this file is called data-extractor-job.yaml and is saved in the templates directory of your helm_deploy. Here is an example data-extractor-job.yaml.
Don’t forget to change the values to suit your service’s needs - notably the app label and the names of the k8s secrets that are used to configure the data extractor.
For each environment (eg dev, preprod, prod) that you want the data extractor to run in you’ll need to enable it by setting the dataExtractor.enabled value to true. eg
dataExtractor:
enabled: true
3. Create the ETL pipeline
The Register My Data service pulls your exported data from the S3 bucket and pushes it into the Analytical Platform. You create a pipeline for your service by editing the config.yaml file in the relevant moj-reg stack (ie moj-reg-dev, moj-reg-preprod, moj-reg-prod).
eg to configure for dev you’d add your service to stacks/moj-reg-dev/config.yaml. You will need the ARN of the user created by terraform in step 1.