Skip to main content

Migrate to the “live” cluster


This is guide to the EKS Migration, for users of the Cloud Platform.

The Cloud Platform team is asking all teams with services running on the platform to migrate their services to a new cluster in the platform, during Autumn 2021. This migration is needed to improve the platform’s maintainability, so that the team can deliver more user-facing functionality.

The expected time for a migration is two hours. Some services may take longer, depending on the features used. Support is on hand throughout the process.

Migration overview

The migration is between Kubernetes clusters:

  • from the “live-1” cluster, on which all namespaces have been running on during 2018-2021
  • to “live”, a new cluster based on AWS’s EKS managed service

There are only minor differences between the old and new clusters, so in most cases no changes are needed to the service.

Migrate each of environment or ‘namespace’ separately. Initially we will invite teams to migrate their ‘development’ namespace, so that any issues can be ironed out before affecting the production service.

During migration, you can continue to deploy your application to the namespace, however you can’t change your namespace configuration or terraform. More info: Answered below under “Can I make changes to my service during migration?” .

If you have any trouble during the migration, then it is easy to roll back at any point.


  • Pilot 20/9/21 - 31/9/21 - A small invited group will migrate their dev namespaces, to iron out teething problems with the process
  • Stage 1 27/10/21 - 30/11/21 - Widespread migration of dev namespaces - to confirm workloads run fine on the new cluster, and any challenges for teams migrating
  • Stage 2 4/1/22 - 31/1/22 - Widespread migration of production (and other) namespaces - e.g. staging, pre-prod and production
  • live-1 deprecated 31/1/22 - live-1 enters a wind-down period, during which the CP team will only provide a minimal level of service for this cluster, it will not be updated and any remaining services run on it at risk

What is EKS and why is Cloud Platform adopting it?

Elastic Kubernetes Service (EKS) is a managed Kubernetes service offered by AWS. Back when we started building the Cloud Platform we made the decision to build our clusters using kOps. This was a self-managed service that enabled us to start building the platform.

Since then EKS has matured as a product and we are now in a position to switch to this solution which will enable us to better meet user needs, and continue to grow the platform. More info: ADR#22

Why should we migrate our services to the new cluster?

Benefits to users of the Cloud Platform:

  • Reduced operational overhead of EKS compared to kOps means the Cloud Platform team can spend more time working on the needs of our users, and improving the platform.
  • Smoother upgrades of Kubernetes reduces the potential for downtime.
  • The move to EKS is an enabler to adopting a multi-cluster architecture, which will enable the platform scale, offer better separation between non-production and production services, and reduce the “blast radius” of unforeseen outages.
  • It also mitigates a number of risks that will improve resilience and security on the platform.

Why call the new cluster “live”?

We didn’t call it “live-2” because we plan to make this the last name change. We’d rather abstract away from users the identity of the cluster, to allow the platform team to shift workloads onto new cluster when doing a blue-green upgrade or disaster recovery. So we plan for this migration to be the last visible name change. Inevitably we will clusters again some time in future, but in URLs the new cluster should be called “live” as well.

We continue use the word “live” to indicate that it is suitable for environments from dev to production. And that it is live to the public, not just internally accessible.

What are we migrating?

You will be migrating your Cloud Platform namespace of Kubernetes resources. This includes all permissions and resources defined in the .yaml and .tf files in your namespace directory. The new Kubernetes cluster is in the same AWS account and VPC, so that already running AWS resources such as RDS and ECR don’t change during this migration.

Why are we doing the extra work of a ‘gradual’ cut-over?

Users of modern digital services expect government services to have no planned downtime, and good reliability. The Cloud Platform has the facility to move traffic to the new cluster gradually (‘canarying’), using weighted DNS, so service teams are expected to make use of this. This is achieved by adding two lines of annotations to the ingress on each cluster and changing the weights. By initially directing only 1% of traffic onto the new cluster and building up, any issues with load, or edge case usage, can be caught early, and the migration can be rolled back. So this small change avoids planned downtime and reduces risk of unplanned downtime.

This guide describes a gradual process for all environments, not just production ones. This is because it is best to test the process works for the non-prod environments, and to familiarize yourself with the steps, before you run them on the production environment.

Can I make changes to my service during migration?

Your application can be deployed as normal. This is because in Step 6 you will configure your application’s CI/CD to deploy your application to both clusters, so they are the same.

However you shouldn’t change your namespace configuration, including your terraform i.e. the .yaml and .tf files in your environment’s folders under the live-1 folder and live folder. This is because the Apply Pipeline can only apply changes to one cluster or the other, not both (because they share the terraform state, which includes some ServiceAccount objects in the cluster, so applying to both clusters would recreate these k8s resources every time). So for the 2 hours (expected) of the migration you should not change this config. If you urgently need to change your namespace config, then you can roll back the migration.

Prerequisites for migration

There are just a few things you need to ensure before you start the migration.

  • Ensure your containers are “stateless”. During the transition between clusters, your pods will be running simultaneously on both “live” and “live-1”. Any state that is in the containers (or inside the pod), such as the ephemeral disk or EBS, will not be copied to the new cluster. For example if you store user session data on disk, then moving it to Elasticache will avoid web users being logged off when traffic gets routed to the other cluster.

  • Check that multiple copies of your app are ok to communicate simultaneously with AWS resources. These AWS resources will be shared between clusters. For example, your RDS instance will accept connections from pods in “live” and “live-1”.

  • If you use Kubernetes Cronjobs and Jobs with option ttlSecondsAfterFinished then refer to this guide: ttlSecondsAfterFinished, you can check that by running below commands in your namespace.

  kubectl -n <namespace-name> get cronjobs -ojson | jq '.items[] | select(.spec.jobTemplate.spec.ttlSecondsAfterFinished != null ) |'

  kubectl -n <namespace-name> get jobs -ojson | jq -r '.items[] | select(.spec.ttlSecondsAfterFinished !=  null ) |'

  cloud-platform version

(Old versions try to auto-update, but this can fail. In this case, to upgrade, run brew update && brew upgrade cloud-platform-cli or download the latest release)

If any of these are a concern, or you’d like to check in advance, please contact the Cloud Platform team.

How to migrate

These steps describe how to migrate. We’d love to make these as clear and simple as possible, so please leave feedback if you’ve either struggled to understand a step or feel it could be clarified further.

Step 1 - Agree a time to perform the migration

During the process of migration you can still deploy your app to both clusters together. However you will not be able to change your environment configuration, as defined in environments repo, which may be an issue. You should speak to your team and agree a time and date to attempt to migrate your namespace. It’s also strongly recommended that you get familiar with migrating a non-production namespace before attempting prod.

Step 2 - Amend and apply your “live-1” ingress resource

You must add two new annotations to your existing ingress resource in “live-1”, needed for load balancing using DNS weighting. We’ve designed this process to allow for moving traffic gradually to the new copy of your app on the new cluster, which also provides easy roll-back, if needed.

This looks as follows:

kind: Ingress
  annotations: <ingress-name>-<namespace-name>-blue "100"

NOTE: Please change the ingress-name and namespace-name values in the above example. The blue is for ingress in “live-1”

Once you have added the above two external-dns annotations and saved your changes to the “live-1” ingress.yaml file, you need ‘apply’ it to the cluster. Ideally you’ve got CI/CD setup, so merging the PR will apply it. Alternatively can apply the ingress manually using this command:

kubectl -n <namespace-name> apply -f ingress.yaml

The idea behind the annotation is to control where traffic is directed to. The addition of aws-weight: 100 against this ingress to the “live-1” cluster changes nothing initially - all of the internet traffic to this domain still goes to your service on the existing cluster. But in a later step we’ll add a second ingress for your domain, on the new “live” cluster. And when you increase the aws-weight on that one from 0 upwards, a proportion of the traffic will be directed to your service running on that cluster too.

Step 3 - Migrate your namespace environment to “live”

To migrate your namespace from “live-1” to “live” you’ll need to copy your namespace directory to a new location in the cloud-platform-environments repository.

  • If you’ve not already got a clone of cloud-platform-environments repository, do that now: git clone Create a branch.

  • Run the cloud-platform cli command from your namespace directory to migrate your namespace:

  cd namespaces/<namespace-name>
  cloud-platform environment migrate
  • Commit changes to your branch, and create a pull request. This will create your namespace in “live”.

The migrate command copies the folder. In some cases it makes minor changes to the terraform or alerts to you changes you need to make manually, as described in these points:

  • If you have a ServiceAccount module for GitHub Actions: please follow this guide

  • If you have an Elasticsearch module, it adds an IRSA change, needed for EKS.

  • Money to prisoners and Pathfinder to analytical platform teams use kiam for cross account IAM roles: You’ll need to use IRSA in EKS. These roles need to be defined inside the environments repo using the guidence here.

  • If your namespace has dependencies on another namespace (e.g. terraform in one namespace creates k8s resources in another namespace), please ask the CP team for advice about resolving it for your circumstance.

If you want to skip any warnings and continue anyway, the cli has a --skip-warnings flag you can enable.

Step 4 - Authenticate to the “live” cluster

Grab a new set of credentials from

You’ll download a new kubecfg.yaml. Copy that to your .kube folder:

    mv ~/Downloads/kubecfg.yaml ~/.kube/config-live

Copy your existing live-1 credentials into a separate file:

    cp ~/.kube/config ~/.kube/config-live-1

To talk to the live cluster copy it into the default location that kubectl looks for credentials:

    cp ~/.kube/config-live ~/.kube/config

To use multiple kubeconfigs at once, follow this guide here

Step 5 - Add a new ingress resource in “live” cluster

Create a new Ingress for “live”, which is almost the same as your existing “live-1” Ingress.

To do this, it depends on how you manage your app’s kubernetes config:

Case 1: You have plain YAML files, one for each environment e.g. dev, staging, prod, which you apply using kubectl apply (either with CI/CD or manually on the command-line). Create a duplicate of your folder of YAML files as shown here. e.g. cp -R deploy-dev deploy-dev-live. Now change the annotations in the new live ingress.yaml, as described below.

Case 2: You have a Helm chart, with templated YAML, with a different values.yaml file for each environment. In the ingress.yaml ensure you’ve included the annotations below, and use variables for their values. Create a duplicate of the values.yaml for the environment you’re migrating as shown here. In all of your values.yaml files (one for each environment), provide values for these annotations, as described below. Every environment and cluster will have a unique value for set-identifier.

Set the annotations in the new (live) ingress to have these two values:

kind: Ingress
  annotations: <ingress-name>-<namespace-name>-green "0"

NOTE: The green is for ingress in “live”


  • set-identifier is set to a string that is unique. No two ingresses or namespaces or clusters can share the same value. This is achieved by you substituting in the ingress name, namespace name and a colour referring to the cluster.
  • aws-weight is set to 0, so that initially no traffic will be sent to the copy of the app running on the “live” cluster. In the later step, we start sending real traffic by increasing this value. (The value for the equivalent environment on the live-1 cluster will be 100, as set in Step 2, so the existing app will continue to get all the traffic for now.)

Once you have created a new live-ingress.yaml with the above two external-dns annotations, you can apply the ingress in the “live” cluster using this command:

  kubectl --context -n <namespace-name> apply -f live-ingress.yaml

NOTE: You will see the below warning when you create new ingress in “live” cluster, please ignore this warning for now.

Warning: Ingress is deprecated in v1.19+, unavailable in v1.22+; use Ingress

The “live-1” cluster is running Kubernetes version “1.18”, live" cluster is upgraded to Kubernetes version “1.20”, the API version, available since v1.19, and the API version of IngressClass is no longer served as of v1.22. We will make sure everyone has migrated over to the new API version before we move to v1.22. There are quite a few changes to a number of different APIs on v1.22 of which we will communicate and give clear instructions on what needs to change, to what and by when.

Step 6 - Create a new step to your deployment pipeline

Your application’s continuous deployment pipeline needs a couple of changes:

  • deploy the app to both clusters, not just “live-1”
  • store a second set of secrets, containing Kubernetes credentials for “live”, and use them when deploying to that cluster

Below is guidance for making these changes, if you’ve followed Cloud Platform CI/CD setup guides for CircleCI or GitHub Actions. However teams have been free to make other choices, so they should adapt these accordingly.

Deploy with CircleCI

For teams who use CircleCI for continuous deployment, in Step 3 when you migrated your namespace environment to “live”, a new ServiceAccount was created in the “live” cluster, for you to use with CircleCI. This ServiceAccount is created by either the serviceaccount.yaml or file in your namespace folder.

For CircleCI to authenticate with the “live” cluster, retrieve ca.crt and token from the new ServiceAccount created in the “live” cluster and add as environment variables in the CircleCI console using different names for “live”, making sure kubectl is talking to the ‘live’ cluster.

NOTE: The cluster_name Environment Variable for the “live” cluster is

To deploy to both “live” and “live-1” you can either:

  • create a separate pipeline to deploy in “live”
  • amend the existing pipeline to add additional steps to authenticate and deploy to “live”. An example config.yml file is available here. It builds an image, authenticates and deploys an application to both the “live-1” and “live” clusters in the same workflow.

NOTE: Make sure you use current versions of the tools image and Circle config, as mentioned in Using CircleCI for Continuous Deployment - setup-kube-auth is deprecated.

Deploy with GitHub Actions

Earlier on in this migration you will have migrated the ServiceAccount module, leading to GitHub Action secrets differentiated by the cluster e.g. KUBE_CLUSTER_LIVE_1.

An example of GitHub Actions config is provided in this repo: It deploys the app to both the live-1 and live clusters.

Step 7 - Copy secrets that are manual

If you have kubernetes secrets that were manually created (not deployed automatically by your application CI/CD or terraform) in live-1, you’ll now need to copy the secrets from live-1 to live, as follows.

It is hard to tell if a secret was created manually, but you can see all the kubernetes secrets in your namespaces on both clusters, in case this helps to identify them:

kubectl --context -n <namespace> get secrets
kubectl --context -n <namespace> get secrets

To copy a secret (including its value) into the new cluster:

kubectl --context -n <namespace> get secret <secret name> -o json | jq -r '. | {apiVersion, kind, metadata, data, type} | del(.metadata.annotations."", .metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.selfLink, .metadata.uid)' | kubectl --context -n <namespace> create -f -

Step 8 - Trigger your pipeline

Trigger your application pipeline. This is normally done manually or by a push to branch.

Step 9 - Test your application

Test your application is working on the new cluster.

This could include:

  • pingdom
  • automated smoke tests - requests should be sent to the new load balancer IP (TOD) and add a “Host: ” header.
  • manual browser testing - duplicate your ingress with different host domain and test it.

    To create this ingress, set your env variable KUBECONFIG pointing to the kubeconfig file(if your config is not in ~/.kube/config) and run the cloud-platform cli command

    kubectl config use-context
    cloud-platform duplicate ingress <ingress resource name> -n <namespace>

    If your ingress name is helloworld-ing and host is, then this command will create a duplicate ingress helloworld-ing-duplicate with host in the given namespace.

Step 10 - Start sending real traffic

Traffic flow is controlled by tweaking external-dns ingress annotation (aws-weight), which is used to determine the proportion of traffic sent to that ingress. Initially it is not expected to have any traffic because live’s ingresses have "0". For example, by setting this value to 5 on “live” and 95 on “live-1”, route53 will send 5% of real traffic to the application on “live”.

To apply the changes to the ingress, as a quick alternative to running your application’s CI/CD pipeline, you can edit them directly:

    kubectl edit ingress <ingress-name> -n <namespace-name>

It is advised to send between 1-10% of the traffic to the live cluster initially. Once the traffic flows into the live cluster and the application behaves as expected, you could send 100% of the traffic into live cluster and 0% into live-1 cluster.

Verify 100% of traffic sent to the service on “live” cluster

After you have set value to “100” for the ingress in “live” and “0” for the ingress in “live-1, you can verify all of the traffic sent into "live” cluster using kibana and grafana

Get the “hostname” used in your ingress for your namespace

  kubectl get ingress <ingress-name> -n <namespace-name> -o json | jq -r '.items[] | .spec.rules[].host'

Using the “hostname” from the above command, make a cURL call every 5 seconds to send traffic to your service

  while sleep 5; do curl -I https://<host-name>; done

To view your ingress traffic, login to grafana-live, in Kubernetes-ingress traffic dashboard, select your ingress-name and namespace-name from the drop-down and do a search, you will see traffic for your ingress from “live” cluster. Now login to grafana-live-1, do the same search, you should not see any traffic for your ingress from “live-1” cluster.

To view your ingress logs, login to kibana, select index live_kubernetes_ingress*, and do a search using "host-name" , you will see new logs for “host-name” from “live” cluster. Now select index live-1_kubernetes_ingress* and do the same search, you should not see any new logs for “host-name” from “live-1” cluster.

Your namespace is successfully migrated to “live” cluster, once all of the traffic send into “live” cluster.

NOTE: If results are not successful within the live cluster, roll back the annotations to their initial weights (‘100’ on “live-1”, ‘0’ on “live”).

Step 11 - Cleanup old “live-1” namespace

Once you have successfully migrated to the “live” cluster, you can tidy up by scaling down your deployment replica to 0 and stop your CI pipeline/job that deploys to the live-1 cluster.

Then, removing your (now unused) namespace from the “live-1” directory:

rm -rf ./namespaces/<my-namespace-name>

As usual, commit this change to a branch, create a PR with title Deletion of NS due to migration to live cluster and once approved merge to main.





Authenticating to the Cluster:

Cluster differences

Aside from the main change, from a kOps to a EKS managed Kubernetes cluster, we’ve minimized the differences between the two clusters, to smooth migrations. In case the minor differences do affect services in some way, we’ve summarized the differences in this list.

The migration process takes account of differences that will affect services:

  • Pods that need to make calls to AWS will need to authenticate differently - they now need to use IRSA, rather than kiam. Terraform modules have been adapted to take account of this.

  • Kubernetes jobs don’t get cleaned after a TTL. TTL Controller for Finished Resources(ttlSecondsAfterFinished) is not available in EKS yet. For users who use ttlSecondsAfterFinished to clear completed jobs, should setup Jobs History Limits using failedJobsHistoryLimit and successfulJobsHistoryLimit.

Other differences which we think will not affect services:

  • EKS manages the control plane (master) nodes instead of kOps.

  • Kubernetes version is 1.19 instead of 1.18. No YAML needs to change because no k8s APIs have been removed in this upgrade.

  • Pod networking is implemented by Amazon’s VPC Container Network Interface (CNI) plugin for Kubernetes. Pods are assigned IP addresses associated with Elastic Network Interfaces on worker nodes. The IPs come from the VPC network pool and therefore do not require NAT to access resources outside the Kubernetes cluster. This replaces the platform’s use of Calico CNI, which uses an overlay network.

How to roll back

Easy roll back is a key feature of this migration process. At any point in this migration process you can revert to using your service on the original “live-1” cluster. This may be necessary if you encounter problems running the namespace on the new cluster, and post-pone the migration until that is resolved.

To roll back, going through the process in reverse, depending on how far you got:

  • Step 11 “Clean-up old "live-1” namespace" - Revert the pull request that deleted your live-1-folder. It doesn’t get applied yet, because of the skipfile.

  • Step 10 - “Start sending real traffic” - Change the ingress weightings to 0 on “live”, and 100 on “live-1”. This directs all traffic to the live-1 cluster. There may be a lag, as users’ DNS caches update.

At this point your application is running as it was pre-migration

  • Step 9 - Test your application - It’s a good idea to check it works fine.

  • (Step 8 - Trigger your pipeline - Not needed - your app is deployed still on the old cluster)

  • Step 7 - Copy manually created secrets - Not needed - The secrets still exists in live-1. Revert PRs if any raised for secrets mentioned on other namespaces in live-1-folder

  • Step 6 - “Create a new step to your deployment pipeline” - You need to ensure your application’s CI/CD deploys:

    • just to the old cluster “live-1”
    • using the old names of credentials (not with the “_live1’ suffix)
  • (Step 5 - Add a new ingress resource - Not needed - you can leave this ingress in place)

  • (Step 4 - Authenticate to the “live” cluster - Not needed - you can keep your kubectl context for the new cluster)

  • Step 3 - "Migrate your namespace environment to "live” - Move the ‘skip file’ from your namespace under the “live-1” folder into your namespace under the “live” folder:

  cd namespaces

Once merged to main, the Apply Pipeline will be configured to apply your namespace settings that are in your “live-1” folder, rather than you “live” folder. (You must have the skip file in one of the folders, otherwise there will be errors on every terraform apply, because the terraform state is shared between the Apply pipelines for live and live-1, will conflict on every apply have both since the terraform is slightly different - IRSA etc)

Important: The ServiceAccount credentials used by your application CI/CD will have been rotated by the Apply Pipeline, within a couple minutes of merging the skip file change. So your application’s CI/CD will need its secrets updating. If you use GitHub Actions then the ServiceAccount terraform module will automatically update the GHA secrets. However if you use CircleCI or any other way of deployment, then you’ll need to obtain the fresh token from “live-1” and provide it to your CI, replacing the existing token secret with this new value.

At this point your application can be re-deployed again normally

At this point you are also free to make changes to your environment under the “live-1” folder

  • (Step 2 - Amend your ingress resource - Not needed - you can leave the annotation in - it won’t change anything)

  • (Step 1 - Agree a time to perform the migration - You can let your team know that they are no longer restricted from changing the environment config)

Frequently asked questions

How to ensure the alerts are migrated?

When you migrate your namespace to live cluster in Step 3, the namespace files related to monitoring and alerting such as ServiceMonitors, NetworkPolicy and prometheusRule in the environments repo are copied and applied to the live cluster.

You can confirm this using kubectl command. For example, to check the prometheusrules, do kubectl get prometheusrules -n <yournamespace> in the live cluster.

After you deploy your application to the live cluster, in the Cloud Platform Prometheus targets, Rules and Service Discovery pages, ensure your application is listed. In the Cloud Platform Alertmanager -> Status, ensure your application severity, receiver and Channel information is added to the list.

How to check the set-identifier and aws-weight annotations are set correctly

Once you have the set-identifier and aws-weight annotations added to your live-1 ingress and applied to live-1 cluster, you can check this reports page and your ingress name should not be on the list.

Do I need to update my ingress spec hosts for live?

No, your current ingress hostname (i.e and your custom hostname) can remain in your deployment to the live cluster.

Does the new EKS cluster have different external IP addresses?

  • Inbound traffic - Yes. Traffic to the new EKS cluster comes into live via ingress NLBs which are dedicated to the cluster. So the domain name for your service will point at these new IP addresses when you start to redirect traffic during: ‘Step 10 - Start sending real traffic’.

  • Outbound traffic - No. The new EKS cluster is in the same VPC of the existing live-1 cluster. Requests leave the VPC using the same three NAT instances. Hence there is no change in the IP addresses from which the traffic will be sent.


If you have questions, please ask us on Slack:

  • #cloud-platform-eks-migration-plan-testing if you’re in this group
  • #ask-cloud-platform

Feedback welcome

We’d love to receive feedback on this process, to make it as low friction as possible. Please drop us a message on Slack:

  • #cloud-platform-eks-migration-plan-testing if you’re in this group
  • DM to @antony.bishop
This page was last reviewed on 27 October 2021. It needs to be reviewed again on 27 January 2022 .
This page was set to be reviewed before 27 January 2022. This might mean the content is out of date.