How to resolve a “ is deleted entirely” incident

Assess in which way it is deleted entirely

  • Navigate to the sourcegraph-dev project and look at the existing Kubernetes clusters. Does the cloud cluster exist still?
    • No, the dot-com cluster is gone:
      • Do the disks for the now-deleted cluster nodes still exist? Check by navigating to Compute -> Disks and searching for pgsql-prod---cloud (
        • Yes, the disks still exist: Go to Recreating GKE cluster and follow the with existing disks steps.
        • No, the disks are gone: Go to Recreating GKE cluster and follow the from snapshots steps.
    • Yes, the cloud cluster exists: Go to Recreating Kubernetes objects

Recreating GKE cluster

We use Terraform to manage our deployments

  1. Navigate to the cloud repo
  2. Follow the instructions there to run terraform plan to see if the infrastructure has drifted from what is specified there.
  3. With existing disks, recreate the Kubernetes objects: a. Do NOT run as it will try to recreate the statically-named disks. b. Run and most things should come online but will still be inaccessible. c. Run configure/ingress-nginx/ to install the nginx ingress. d. Expose the cluster by running ONLY the kubectl expose commands found in e. Go to Confirm health of
  4. From snapshots, recreate the Kubernetes objects: a. Since nothing exists, run and it will recreate everything including disks. b. should now be accessible, but with no postgres, redis-store, or precise-code-intel-bundle-manager data present. c. Restore pgsql disks from the latest pgsql-prod---cloud compute snapshot, redis-store---cloud snapshot, and bundle-manager---cloud snapshot:
    • TODO: this section should be more explicit about what needs to be done. d. Go to Confirm health of

Recreating Kubernetes objects

  1. Navigate to the cloud cluster on the Google Cloud console and click Connect, run the `gcloud command it gives you.
  2. kubectl -n prod get deployments should show partial or no Kubernetes deployments, but that you are connected to the right cluster.
  3. In the repository’s latest release branch, run which will recreate all Kubernetes objects.
  • uses static disk attachments, so the volumes should still be valid and no data should have been lost.

Go to Confirm health of

Confirm health of

Follow the documented regular incident follow-up procedures.