Distaster recovery process of a Cloud instance

SOC2/CI-82

Report from failover test on 28th of November 2022 Report from disaster recovery test on 30th-31th of January 2024

GKE cluster zone failover

  • export environment variables
export ENVIRONMENT=[dev|prod]
export SLUG=<SLUG>
export GKE_NAME=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.gkeClusters[0].name')
export GKE_REGION=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.region')
export GCP_PROJECT=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.projectId')
  • extract the instance from Control Plane if cloud.sourcegraph.com/control-plane-mode=true is in config.yaml

Follow the Extract instance from control plane (break glass) section from the Ops Dashboard of the instance, go/cloud-ops

  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraph.com/sign-in -i
  • connect to cluster
mi2 instance workon -e $ENVIRONMENT --slug $SLUG -exec
  • verify node zone
kubectl get nodes
kubectl describe node <NODE_FROM_CLUSTER> | grep zone
  • perform zone failover (remove node zone from GKE node locations)
gcloud container node-pools list --cluster $GKE_NAME --region $GKE_REGION --project $GCP_PROJECT
gcloud container node-pools describe primary --cluster $GKE_NAME --region $GKE_REGION --project $GCP_PROJECT --format json | jq '.locations'
gcloud container node-pools update primary --cluster $GKE_NAME --region $GKE_REGION --project $GCP_PROJECT --node-locations <DIFFERENT_ZONE_THAN_EXISTING_NODE> --async
  • verify pods were terminated
kubectl get pods # should show failing pods, b/c node was drained
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health # should fail
  • wait for new node to be ready
kubect get nodes # waiting for new node
  • verify new node zone
kubectl describe node <NEW_NODE> | grep zone # should be different from previous node
  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraph.com/sign-in -i
  • backfill the instance into Control Plane if cloud.sourcegraph.com/control-plane-mode=true is in config.yaml

Follow the Backfill instance into control plane section from the Ops Dashboard of the instance, go/cloud-ops

CloudSQL zone failover

  • export environment variables
export ENVIRONMENT=[dev|prod]
export SLUG=<SLUG>
export CLOUDSQL_INSTANCE_NAME=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.cloudSQL[0].name')
export GCP_PROJECT=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.status.gcp.projectId')
export INSTANCE_ID=$(mi2 instance get -e $ENVIRONMENT --slug $SLUG | jq -r '.metadata.name')
  • extract the instance from Control Plane if cloud.sourcegraph.com/control-plane-mode=true is in config.yaml

Follow the Extract instance from control plane (break glass) section from the Ops Dashboard of the instance, go/cloud-ops

  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraph.com/sign-in -i
  • patch CloudSQL instance to use different zone
gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --project $GCP_PROJECT | grep zone
# returns actual CloudSQL zone
mi2 instance edit --jq '.spec.infrastructure.gcp.zone = "'$FAILOVER_ZONE'"' --slug $SLUG -e $ENVIRONMENT
mi2 generate cdktf -e $ENVIRONMENT --slug $SLUG
cd environments/$ENVIRONMENT/deployments/$INSTANCE_ID/terraform/stacks/sql
terraform init && terraform apply -auto-approve

gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --project $GCP_PROJECT | grep zone
# should return <FAILOVER_ZONE>
cd -
  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraph.com/sign-in -i

Below steps are optional, they should be performed only if CloudSQL disk was lost.

  • restore backup in different zone
mi2 instance sql-backup list --slug $SLUG -e $ENVIRONMENT
mi2 instance sql-restore create --backup-id $SQL_BACKUP_ID --slug $SLUG -e $ENVIRONMENT
# wait until ready
# can check status with command: mi2 instance sql-restore list --slug $SLUG -e $ENVIRONMENT
gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --project $GCP_PROJECT
  • check instance is healthy
mi2 instance check --slug $SLUG -e $ENVIRONMENT pods-health
curl -sSL --fail https://$SLUG.sourcegraph.com/sign-in -i
  • backfill the instance into Control Plane if cloud.sourcegraph.com/control-plane-mode=true is in config.yaml

Follow the Backfill instance into control plane section from the Ops Dashboard of the instance, go/cloud-ops