Upgrading a Cloud instance
SOC2/CI-108
First, create the upgrade tracking issue. Following the instruction provided by the command below.
mi2 env create-tracking-issue -e prod $TARGET_VERSION
Option 1 - GitHub Action
An automated upgrade can be initiated via the Upgrade Instance GitHub action workflow.
A pull request will be created and link to the upgrade tracking issue automatically upon successful completion of the job run.
Option 2 - Batch automate upgrade
To upgrade a specific instance
export SLUG=company
export ENVIRONMENT=prod
export TARGET_VERSION=4.1.0
gh workflow run mi_upgrade.yml -f customer=$SLUG -f environment=$ENVIRONMENT -f target_src_version=$TARGET_VERSION
To upgrade all instances within an environment
Using mi2
you can generate commands to trigger automated upgrades for all instances:
for internal instances:
mi2 env workflow run -e prod -filter '.metadata.labels."instance-type" == "internal" and .spec.sourcegraphVersion != "'$TARGET_VERSION'"' -format text upgrade-instance
for production instances:
mi2 env workflow run -e prod -filter '.metadata.labels."instance-type" == "production" and .spec.sourcegraphVersion != "'$TARGET_VERSION'"' -format text upgrade-instance
for trial instances:
mi2 env workflow run -e prod -filter '.metadata.labels."instance-type" == "trial" and .spec.sourcegraphVersion != "'$TARGET_VERSION'"' -format text upgrade-instance
all instances
mi2 env workflow run -e prod -filter '.spec.sourcegraphVersion != "'$TARGET_VERSION'"' -format text upgrade-instance
This automated workflow will generate a pull request for each instance to represent the upgrade that:
- Links to full logs of the automated upgrade process (retained for 90 days)
- Embeds a summary of an automated full instance healthcheck (retained permanently)
- Links to the tracking issue associated with the upgrade
To review currently open PRs for successful instance upgrades using gh
:
# Sanity check to see that the PRs correspond to instances you have upgraded
gh pr list --label 'mi-upgrade'
# Save the list of PRs you are going to work with
gh pr list --label 'mi-upgrade' --json number --jq '.[].number' > upgrade-prs.txt
# Review the test plan of each PR (press 'q' to review the next one)
cat upgrade-prs.txt | xargs -n1 gh pr view
If an upgrade fails, follow the logs and figure out which step went wrong, then follow the manual upgrade to finish the upgrade.
If all is well, approve and merge each instance upgrade:
# Approve each PR
cat upgrade-prs.txt | xargs -n1 gh pr review --approve
# Merge each PR
cat upgrade-prs.txt | xargs -n1 gh pr merge --squash
Finally, update the tracking issue.
Option 3 - Manual upgrade
Prereq
Follow https://github.com/sourcegraph/controller#installation to install mi2
Set environment variables
Sharing
TF_TOKEN_app_terraform_io
is only temporary, this is expected to change in the future.
TARGET_VERSION
could be one of, release version, release candidate version, orlatest
. Whenlatest
is provided,mi2
will automatically figure out the latest main branch tag and persist it inconfig.yaml
Bash
export SLUG=company
export ENVIRONMENT=dev
export TF_TOKEN_app_terraform_io=$(gcloud secrets versions access latest --project=sourcegraph-secrets --secret=TFC_TEAM_TOKEN)
export TARGET_VERSION=4.1.0
Fish
set -x SLUG company
set -x ENVIRONMENT dev
set -x TF_TOKEN_app_terraform_io (gcloud secrets versions access latest --project=sourcegraph-secrets --secret=TFC_TEAM_TOKEN)
set -x TARGET_VERSION 4.1.0
Check out a new branch
git checkout -b $SLUG/upgrade-instance
All commands below are executed under the instance root and use auto inference to locate the instance, hence -e
and -slug
flags are omitted.
export INSTANCE_ID=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.metadata.name')
cd environments/$ENVIRONMENT/deployments/$INSTANCE_ID/
Upgrade instance - database migration
This will perform
- health check
- bump version in
config.yaml
and regenerate deployment artifacts (e.g. kustomize, cdktf) - backup both the GKE cluster state (e.g. manifest, disks) and Cloud SQL to create a checkpoint
- deploy migrator as a k8s Job to perform database migration
mi2 instance upgrade -target-version $TARGET_VERSION
Confirm workload are healthy
mi2 instance check pods-health
Upgrade instance - workload update
Switch to the right gke cluster and configure kubectl
context
mi2 instance workon -exec
Deploy the new workload
cd kubernetes
kustomize build --load-restrictor LoadRestrictionsNone --enable-helm . | kubectl apply -f -
Upgrade instance - health check
Confirm workload are healthy and remote version is the same as target version
mi2 instance check pods-health src-application-version
Open a pull request
You should expected changes are made to
config.yaml
kubernetes/*
terraform/stacks/*
charts/*
Commit all changes
git add .
git commit -m "$SLUG: upgrade to $TARGET_VERSION"
gh pr create --title "$SLUG: upgrade to $TARGET_VERSION"
In the PR, include output of
mi2 instance check pods-health
Merge the pull request
Make sure to merge the pull request in time so we can rollout changes to executors
terraform module.
The module is configured to be auto-apply, hence no action is required.
Rollback
Depending on the failure scenario, we have different fallback strategy
Database migration failed
Our database schema is supposed to be at least n-2
(where n
is the current minor version) compatible.
A failing database migration usually does not result in direct impact on customers and we do not rollout workload until database migration goes through.
For now, follow how to troubleshoot dirty databases to resolve the issue.
In the future, we will add more commands to mi2
cli to interact with migrator directlly or even provide auto-fix solution in the mi2 instance upgrade
command.
New workload failed to start
We use rolling update, and usually the older pod will still be serving traffic untill the new pod is healthy. Of course, with the exception of statefulset.
If it is causing breakage to the services (e.g. uptime check is failing), we should aim to rollback as soon as possible.
It could be as simple as reverting changes to the generated k8s manifest
git restore --source=origin/main --worktree kubernetes
kustomize build --load-restrictor LoadRestrictionsNone --enable-helm . | kubectl apply -f -
or you can also rollback individual deploy/sts
kubectl rollout undo deploy <>
kubectl rollout undo sts <>
If there is a data corruption or some other more complicated cases, follow the restore process.