Cloud Platform Operational Processes
This is a record of the operational processes that we will use to support our users and to maintain the cloud platform.
Hours of support
Our hours of providing support are 10am - 5pm. During this time we will work on support requests from teams and make sure someone is available to answer questions in Slack channel
#ask-cloud-platform and look at PR request notifications for the cloud-platform-environments repo on
Outside of these hours we will respond to high priority incidents as per our on call process.
A member of the support team will be available online from 9am and until 6pm. This team member can be working remotely. We are doing this so that we cover incident response for the time when the person on call is coming into the office or going home at the end of the day.
This support team member should be ready to respond to any high priority incidents.
The support team
The key activities for the day are:
On starting the day (9AM) At least one engineer to:
- Check the
#high-priority-alarmsslack channels for any issues to investigate
- Check the board for any user requests that have been raised and not assigned
- Get a handover from the on call engineer about any issues out of hours
- Read through
During support hours (10AM - 5PM) At least one engineer to:
- Actively participate in
#ask-cloud-platformto field support requests, triage, prioritise and fix
#low-priority-alarmsfor incidents, triage, prioritise and fix
#cloud-platform-notifyfor cloud-platform-environments repo PR request notifications and cloud platform build notifications.
- Monitor user support requests being added to the board and triage them for priority
The whole support team to:
- Manage incident process for incidents in progress
- Work on user requests for larger work
- Work on support backlog stories
Before ending the day (6PM) At least one engineer to:
- Handover information about any planned work or in progress high priority incidents to on call engineer
#ask-cloud-platform slack channel
#ask-cloud-platform channel is our main entry point for support. We encourage people to ask questions and report problems in this channel and we’ll do the best we can to help. We also welcome comments, opinions and any feedback regarding any of the services and the platform as a whole. This channel can also be used to communicate to other cloud platform users on topics you think that might be useful for other users.
One of the Cloud Platform Team will be available to answer questions throughout the hours of support (10AM - 5PM).
What we communicate
The main purpose of
#ask-cloud-platform is to discuss the problems that people are having and help them to solve them.
User support tickets
If someone is asking for help that will be quick to do (less than 15 mins) or is mainly advice then keep the interaction in channel and get it done.
If someone needs something that takes longer, is more challenging to complete or you find you’ve spent longer than 15 mins on it, then continue to talk in the channel but also ask the person asking for help to raise a ticket using the GitHub issue link.
The purpose of the ticket is to keep a record of the work we are doing and how it is progressing. It is not the primary communication channel with the person who raised the problem - this should remain in slack where you can provide a richer and more human interaction, answering questions if necessary. The engineer working on the ticket should update the ticket for our records as work progresses.
In many cases it can be more helpful for the engineer on support to create the ticket rather than the person reporting the problem as you will be able to provide more relevant context.
Once they have created the ticket it will appear in the
Support New Issues column of our sprint board. It can then be moved into the relevant columns as you work on it.
#cloud-platform-update slack channel
#cloud-platform-update is a low-volume channel for broadcast communications. Examples include:
- to announce future scheduled work (e.g. “we will be upgrading kubernetes on … no downtime is expected”)
- when there is planned work on a service (e.g. “we are doing some maintenance work on an RDS instance in prod”)
- when there is an incident in progress (e.g “there is a problem with Jenkins, no one can access it right now”)
In these situations:
- try not to use notifiers like
@channelunless absolutely necessary
- perhaps use visual clues in your message to show it is an announcement
- commit to a time when you will update on the situation and keep that commitment
So for example:
ANNOUNCEMENT: As agreed with the relevant teams, the cloud platform team will be upgrading Postgres from version 9.2 (out of support) to version 9.6 on CLA, Moving People Safely and Prison Visits Booking RDS instances today.
We will be doing this work between 10AM and 3PM, we will update at 1PM and 3PM to say how it is going.
Our on call process
Team members who are on call manage an on call rota in pagerduty. The on call rota consists of a primary engineer and a secondary engineer.
Team members that are new to the rota should ensure that they have a secondary who has enough knowledge and experience of the process to support them.
The hours of on call are 7AM–10AM and 5PM–10PM weekdays, 7AM-10PM weekends and bank holidays. 9AM - 10AM and 5PM - 6PM are additionally covered by a team member from support who is online during those periods.
On call team members will respond to high priority incidents during on call hours and will work on those incidents for up to 1 hour. If the engineer is not able to resolve the issue within that timeframe they will:
- inform the affected team
- put together a plan to resolve the issue during normal working hours
Critical incidents only - contact the out-of-hours on-call person
In the event of a ‘Critical incident’ to your application (out of support hours), please email Contactemail@example.com , stating:
- the known facts surrounding the incident
- please also state the ‘Slack’ channel on which further communication can take place.
The on-call person will get back to you on ‘Slack’ as soon as possible.
It must be emphasised that this is ONLY for ‘Critical Incidents’ and is not for out-of-hours application or developmental support.
Support given will be on a best endeavours basis.
All Cloud Platform documentation is openly available in Git repos stored on Github. The starting point for this documentation is the Cloud Platform repo.
For users of the Cloud Platform there is the Cloud Platform User Guide. This guide is the “front door” to the Cloud Platform and contains concepts, tasks, walkthroughs and other information to help teams use the Cloud Platform.
For our legacy systems based around Template Deploy, documentation lives in Confluence. This documentation includes architecture, runbooks and common issues.