Disaster recovery process of a Cloud instance

SOC2/CI-82

Report from failover test on 28th of November 2022 Report from disaster recovery test on 30th-31th of January 2024

GKE cluster zone failover

  • export environment variables
export ENVIRONMENT=[dev|prod]
export SLUG=<SLUG>
export GKE_NAME=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.gkeClusters[0].name')
export GKE_REGION=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.region')
export GCP_PROJECT=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.projectId')
  • extract the instance from Control Plane if cloud.sourcegraph.com/control-plane-mode=true is in config.yaml

Follow the Extract instance from control plane (break glass) section from the Ops Dashboard of the instance, go/cloud-ops

  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraphcloud.com/sign-in -i
  • connect to cluster
mi2 instance workon -e $ENVIRONMENT --slug $SLUG -exec
  • verify node zone
kubectl get nodes
kubectl describe node <NODE_FROM_CLUSTER> | grep zone
  • perform zone failover (remove node zone from GKE node locations)

NOTE ON TARGET ZONES gcloud container node-pools describe will return a list of zones into which the node pool can be deployed. The output of the kubectl describe node command above will show which of those zones is actually in use.

The TARGET_ZONE will take a list of zones into which the node pool should be deployed. You should remove the failed zone from this list (and add new zones as needed). For instance: if the current node-pools are in us-central-1a and us-central-1c, and the active node is provisioned in us-central-1a, you can fail over to us-central-1c by removing us-central-1a from the list.

gcloud container node-pools list --cluster $GKE_NAME --region $GKE_REGION --project $GCP_PROJECT
gcloud container node-pools describe primary --cluster $GKE_NAME --region $GKE_REGION --project $GCP_PROJECT --format json | jq '.locations'
gcloud container node-pools update primary --cluster $GKE_NAME --region $GKE_REGION --project $GCP_PROJECT --node-locations <TARGET_ZONE> --async
  • verify pods were terminated
kubectl get pods # should show failing pods, b/c node was drained
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health # should fail
  • wait for new node to be ready
kubect get nodes # waiting for new node
  • verify new node zone
kubectl describe node <NEW_NODE> | grep zone # should be different from previous node
  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraphcloud.com/sign-in -i
  • backfill the instance into Control Plane if cloud.sourcegraph.com/control-plane-mode=true is in config.yaml

Follow the Backfill instance into control plane section from the Ops Dashboard of the instance, go/cloud-ops

CloudSQL zone failover

  • export environment variables
export ENVIRONMENT=[dev|prod]
export SLUG=<SLUG>
export CLOUDSQL_INSTANCE_NAME=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.cloudSQL[0].name')
export GCP_PROJECT=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.projectId')
export INSTANCE_ID=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.metadata.name')
  • extract the instance from Control Plane if cloud.sourcegraph.com/control-plane-mode=true is in config.yaml

Follow the Extract instance from control plane (break glass) section from the Ops Dashboard of the instance, go/cloud-ops

  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraphcloud.com/sign-in -i
  • export environment variables
export FAILOVER_ZONE=<new target zone>
  • patch CloudSQL instance to use different zone
gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --project $GCP_PROJECT | grep zone
# returns actual CloudSQL zone
mi2 instance edit --jq '.spec.infrastructure.gcp.zone = "'$FAILOVER_ZONE'"' --slug $SLUG -e $ENVIRONMENT
mi2 generate cdktf -e $ENVIRONMENT --slug $SLUG
cd environments/$ENVIRONMENT/deployments/$INSTANCE_ID/terraform/stacks/sql
terraform init && terraform apply -auto-approve

gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --project $GCP_PROJECT | grep zone
# should return <FAILOVER_ZONE>
cd -
  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraphcloud.com/sign-in -i

Below steps are optional, they should be performed only if CloudSQL disk was lost.

  • restore backup in different zone
mi2 instance sql-backup list --slug $SLUG -e $ENVIRONMENT
mi2 instance sql-restore create --backup-id $SQL_BACKUP_ID --slug $SLUG -e $ENVIRONMENT
# wait until ready
# can check status with command: mi2 instance sql-restore list --slug $SLUG -e $ENVIRONMENT
gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --project $GCP_PROJECT
  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraphcloud.com/sign-in -i
  • backfill the instance into Control Plane if cloud.sourcegraph.com/control-plane-mode=true is in config.yaml

Follow the Backfill instance into control plane section from the Ops Dashboard of the instance, go/cloud-ops