Skip to main content

Cloud Platform Operational Processes

This is a record of the operational processes that we will use to support our users and to maintain the cloud platform.

Hours of support

Our hours of providing support are 10am - 5pm. During this time we will work on support requests from teams and make sure someone is available to answer questions in Slack channel #ask-cloud-platform and look at PR request notifications for the cloud-platform-environments repo on #cloud-platform-notify

Outside of these hours we will respond to high priority incidents as per our on call process.

Working hours

A member of the support team will be available online from 9am and until 6pm. This team member can be working remotely. We are doing this so that we cover incident response for the time when the person on call is coming into the office or going home at the end of the day.

This support team member should be ready to respond to any high priority incidents.

The support team

The key activities for the day are:

On starting the day (9AM) At least one engineer to:

During support hours (10AM - 5PM) At least one engineer to:

  • Actively participate in #ask-cloud-platform to field support requests, triage, prioritise and fix
  • Monitor #high-priority-alarms and #low-priority-alarms for incidents, triage, prioritise and fix
  • Monitor #cloud-platform-notify for cloud-platform-environments repo PR request notifications and cloud platform build notifications.
  • Monitor user support requests being added to the board and triage them for priority

The whole support team to:

  • Manage incident process for incidents in progress
  • Work on user requests for larger work
  • Work on support backlog stories

Before ending the day (6PM) At least one engineer to:

  • Handover information about any planned work or in progress high priority incidents to on call engineer

Escalation Process

The Cloud Platform Team endeavours to respond to all question and requests for support raised in the #ask-cloud-platform in a timely manner. However, sometimes support can be busy with engineers working on or triage a number of different requests from users.

If you have a service impacting incident that you have raised in the #ask-cloud-platform but have not had a response you can escalate your incident by contacting emailing the Contact-on-call-cloud-platform@digital.justice.gov.uk. This will trigger the critical incident process and alert the Cloud Platform Team via the high-priority alerts channel.

Please note this process should NOT be used to: * get updates on standard pull requests * checking on status of Issues * ask for support on anything other than service impacting incidents

If you require support for a high criticality incident out of hours please see the out of hours Critical Incident Process.

#ask-cloud-platform slack channel

The #ask-cloud-platform channel is our main entry point for support. We encourage people to ask questions and report problems in this channel and we’ll do the best we can to help. We also welcome comments, opinions and any feedback regarding any of the services and the platform as a whole. This channel can also be used to communicate to other cloud platform users on topics you think that might be useful for other users.

One of the Cloud Platform Team will be available to answer questions throughout the hours of support (10AM - 5PM).

What we communicate

The main purpose of #ask-cloud-platform is to discuss the problems that people are having and help them to solve them.

User support tickets

If someone is asking for help that will be quick to do (less than 15 mins) or is mainly advice then keep the interaction in channel and get it done.

If someone needs something that takes longer, is more challenging to complete or you find you’ve spent longer than 15 mins on it, then continue to talk in the channel but also ask the person asking for help to raise a ticket using the GitHub issue link.

The purpose of the ticket is to keep a record of the work we are doing and how it is progressing. It is not the primary communication channel with the person who raised the problem - this should remain in slack where you can provide a richer and more human interaction, answering questions if necessary. The engineer working on the ticket should update the ticket for our records as work progresses.

In many cases it can be more helpful for the engineer on support to create the ticket rather than the person reporting the problem as you will be able to provide more relevant context.

Once they have created the ticket it will appear in the Support New Issues column of our sprint board. It can then be moved into the relevant columns as you work on it.

#cloud-platform-update slack channel

#cloud-platform-update is a low-volume channel for broadcast communications. Examples include:

  • to announce future scheduled work (e.g. “we will be upgrading kubernetes on … no downtime is expected”)
  • when there is planned work on a service (e.g. “we are doing some maintenance work on an RDS instance in prod”)
  • when there is an incident in progress (e.g “there is a problem with Jenkins, no one can access it right now”)

In these situations:

  • try not to use notifiers like @here and @channel unless absolutely necessary
  • perhaps use visual clues in your message to show it is an announcement
  • commit to a time when you will update on the situation and keep that commitment

So for example:

ANNOUNCEMENT: As agreed with the relevant teams, the cloud platform team will be upgrading Postgres from version 9.2 (out of support) to version 9.6 on CLA, Moving People Safely and Prison Visits Booking RDS instances today.

We will be doing this work between 10AM and 3PM, we will update at 1PM and 3PM to say how it is going.

Our on call process

Team members who are on call manage an on call rota in pagerduty. The on call rota consists of a primary engineer and a secondary engineer.

Team members that are new to the rota should ensure that they have a secondary who has enough knowledge and experience of the process to support them.

The hours of on call are 7AM–10AM and 5PM–10PM weekdays, 7AM-10PM weekends and bank holidays. 9AM - 10AM and 5PM - 6PM are additionally covered by a team member from support who is online during those periods.

On call team members will respond to high priority incidents during on call hours and will work on those incidents for up to 1 hour. If the engineer is not able to resolve the issue within that timeframe they will:

  • inform the affected team
  • put together a plan to resolve the issue during normal working hours

Critical incidents only - contact the out-of-hours on-call person

In the event of a ‘Critical incident’ to your application (out of support hours), please email Contact-on-call-cloud-platform@digital.justice.gov.uk , stating:

  • the known facts surrounding the incident
  • please also state the ‘Slack’ channel on which further communication can take place.

The on-call person will get back to you on ‘Slack’ as soon as possible.

It must be emphasised that this is ONLY for ‘Critical Incidents’ and is not for out-of-hours application or developmental support.

Support given will be on a best endeavours basis.

Our documentation

All Cloud Platform documentation is openly available in Git repos stored on Github. The starting point for this documentation is the Cloud Platform repo.

For users of the Cloud Platform there is the Cloud Platform User Guide. This guide is the “front door” to the Cloud Platform and contains concepts, tasks, walkthroughs and other information to help teams use the Cloud Platform.

This page was last reviewed on 23 February 2024. It needs to be reviewed again on 23 May 2024 by the page owner #cloud-platform .
This page was set to be reviewed before 23 May 2024 by the page owner #cloud-platform. This might mean the content is out of date.