Continuous integration playbook

Maintainers: DevInfra Team.
Audience: any software engineer, no prior infrastructure knowlegde required.
TL;DR This document sums up what to do in various scenarios that can block the CI.

Sourcegraph’s continuous integration (CI) is what enables us to feel confident when delivering our changes to our users, and is one of the key components enabling Sourcegraph to deliver quality software. While the DevInfra team is in charge of managing the CI as a tool, it is essential for every engineer to be able to unblock themselves if there is a problem in order be autonomous.

This page lists common failure scenarios and provides a step by step guide to get the CI back in an operational state.

Prerequisites

In order to handle problems with the CI, the following elements are necessary:

Have access to the sourcegraph-ci project on Google Cloud Platform.
Ask #it-tech-ops for access if you do not have access.
Have the gcloud CLI installed.
Have the kubectl CLI installed.
Gain access to the CI cluster by authenticating against it with gcloud and kubectl.
Request access to the DevX Day2Day entitle bundle by typing /access_request in Slack.

Scenarios

`buildchecker` has locked the `main` branch

Severity: major
Impact:
- No pull requests may be merged except by authors of
- Pull request builds may be failing as well
Possible causes:
- buildchecker will lock/restrict push access to the main branch if a series of failed builds is detected - this can indicate that a regression has been merged into main or that critical build infrastructure is failing.

Actions

buildchecker will still allow the authors of the last few failed builds, as well as the @dev-infra team, to push to the main branch so as to make any changes necessary to restore the pipeline to a healthy state.

Follow the “Build has failed on the main branch” guide.
If the issue has been resolved, wait for buildchecker to unlock the branch or manually trigger a run (click “Run workflow”).

Build has failed on the `main` branch

Severity: minor
Impact: that commit won’t be deployed on k8s.sgdev.org and sourcegraph.com until an ulterior build passes.
Possible causes:
- The main branch runs additional checks compared to Pull Requests builds. So it’s possible that one of those checks failed.
  - 💡 The checks are dynamically generated by our pipeline generation tool. The main branch has notably much more exhaustive checks than other branches.
- The main branch have changes that weren’t in the Pull Request branch and those changes are causing a failure.
- The main branch is failing due to a previous build.

Actions

Check your build on Buildkite.
- Find its link directly in the #buildkite-main channel.
- 💡 Or run sg ci status in your shell, with the main branch checked out.
Search for the failing steps, and browse the logs (💡 run sg ci logs in your shell, with the main branch checked out) .
- Look for a failure explanation: it can be a test that failed or a command that return a non zero exit code.
Check the previous builds on the main branch on Buildkite
1. Are they failing with the same exact error?
  - Yes: see the Builds are failing in the main branch with the same error
  - No: see next point.
Is that a real failure or a flake?
1. Restart that step. Maybe it will fail again, but if it doesn’t it’ll save you time.
  - 💡 You can go to 3. while it runs.
2. See Is that a failure or a flake scenario
3. Did restarting it fixed the problem?
  - Yes: that’s a flake. See the Spotted a flake scenario
  - No: see next point.
4. Does the failure points to problem with the code that was shipped on that commit?
  1. Yes, and it’s a very quick fix that can get merged promptly:
    1. Write a short message on #buildkite-main and tell others that you’re fixing it.
    2. Submit the fix with another PR and get it merged as soon as possible.
  2. Yes, but it’s not easily and/or quickly fixed
    1. Revert the incriminating Pull Request.
    2. Open a GitHub issue mentioning the build and the context to explain to the team owning that test what happened.
    3. Checkout the PR branch.
    4. Rebase it so it includes the changes that broke it when merged in the main branch.
    5. Create a build using sg ci build main-dry-run in order to get the CI to run the same exact checks it does on the main branch.
  3. No, but it seems to fail in step or code from another team.
    1. Reach out a member of the team responsible for that test.
    2. go for a. or b. from the previous points.
5. No, and there is suspicion of a flake.
  - Yes: that’s a flake. See the Spotted a flake scenario

Builds are all failing on the `main` branch with the same error

Severity: major
Impact: no commits are being deployed on DogFood and sourcegraph.com until the problem is resolved. Cutting a release is impossible.
Possible causes:
- A previous Pull Request introduced a change that causes a test to fail.
- A previous Pull Request introduced a change that modified state in an unexpected way and broke the CI.
- An external dependency is not available anymore and is causing builds to fail.
- Some rate limiting API is throttling us and causing builds to fail.

Actions

Identify the error in common with the recent builds on Buildkite.
Find the build where the problem appeared for the first time.
- 💡 Often it’s the first build that became red, but check that the error is the same to be sure.
Is this an external failure or an internal one?
- 💡 External failures are about downloading a dependency like a package in a script or a in a Dockerfile. Often they’ll manifest in the form of an HTTP error.
- 💡 If unsure, ask for help on #dev-chat.
- Yes, it’s an external failure:
  1. See the SSH into an agent scenario
  2. Try to reproduce the faulty HTTP request so you can observe what’s the problem. Is it the same failure?
    - Yes: Do you know how to fix it? If no escalate by creating an incident (/incident on Slack).
    - No: escalate by creating an incident (/incident on Slack).
- No, it’s an internal failure:
  1. Is it involving a faulty build environment in the agents? (a given tool is not found where it should have been present, or have incorrect version)
    - See the SSH into an agent scenario
  2. Try to find an agent that recently successfully ran the faulty step (look for a green build on the main branch)
    1. Can you see a difference? If yes take note.
  3. Do you know how to fix it?
    - Yes: apply the fix.
    - No: escalate by creating an incident (/incident on Slack).

Build are failing on the `main` branch with different errors

Severity: major
Impact: no commits are being deployed on DogFood and sourcegraph.com until the problem is resolved. Cutting a release is impossible.
Possible causes:
- A previous Pull Request introduced a change that causes a test to fail.
- An external dependency is not available anymore and is causing builds to fail under certain conditions.
- Some rate limiting API is throttling us and causing builds to fail.

Actions

Escalate by creating an incident (/incident on Slack).
Get some help by pinging @dev-infra-support on Slack in the #buildkite-main or #discuss-dev-infra channels.

Severity: major
Impact: no commits are being deployed on DogFood and sourcegraph.com until the problem is resolved. Cutting a release is impossible.
Possible causes:
- A previous Pull Request introduced a change that causes a test to fail. If that’s the case you should see the problem on the main build corresponding to the commit you branched out from.
- A previous Pull Request introduced a change that modified state in an unexpected way and broke the CI. If that’s the case you should see the problem on the main build corresponding to the commit you branched out from.
- A previous build did not properly teardown containers used in e2e test suites.
- Agents are in a corrupted state due to a previous build.
- Agents ran out of disk space.

Actions

Escalate by creating an incident (/incident on Slack).
Get some help by pinging @dev-infra-support on Slack in the #buildkite-main or #discuss-dev-infra channels.
Request access to the DevX Day2Day entitle bundle by typing /access_request in Slack.
Restart the agents by scaling the corresponding deployment to 0 then to 2 again.
- kubectl scale --replicas=0 -n buildkite-bazel deployments/buildkite-agent-bazel
- Observe the pods count going down.
- kubectl scale --replicas=2 -n buildkite-bazel deployments/buildkite-agent-bazel
- The agent autoscaler will adjust the final replicas count on its own.
If you saw cache releated errors in the job logs, restart the remote-cache by scaling the corresponding deployment to 0 then to 1 again.
- kubectl scale --replicas=0 -n buildkite-bazel deployments/ci-bazel-remote-cache
- Observe the pods count going down.
- kubectl scale --replicas=1 -n buildkite-bazel deployments/ci-bazel-remote-cache
- Do not scale it above 1 instance, it uses a persistent disk that can only be accessed by a single instance.

Spotted a flake

Severity: minor
Impact: Some builds will fail randomly, creating noise and slowing down the engineering team
Possible causes:
- Tests relying on timing.
- Race conditions.
- End to end tests are delicate by nature and can fail randomly due to the complexity of all involved components.

Actions

What kind of step is failing?

Is this an End-to-end tests?
- 💡 E2E tests are fragile by nature, there is no way around it.
- Take note.
Is this a Docker image build step?
- 💡 This should really not be happening.
- Is the error about the Docker daemon?
  - Yes, this is a CI infrastructure flake. Ping @dev-infra-support on Slack in the #buildkite-main or #discuss-dev-infra channels.
  - No: reach out to the team owning that Docker image immediately.
Anything else
- Take note of the failing step and go to next point.

Is that flake related to the CI infrastructure?

The CI infrastructure often involves:
- Docker daemon not being reachable.
- Missing tools that we use to run the steps, such as go, node, comby, …
- Errors from asdf, which is used to manage the above tools.
Yes: ping @dev-infra-support on Slack in the #buildkite-main or #discuss-dev-infra channels.
- If nodoby is online to help:
  - Reach out for help in #dev-chat

Is that flake related to the code:

See the process describe in the flaky tests page

Is this a failure or a flake?

Gravity: minor
Impact: Some builds will fail randomly, creating noise and slowing down the engineering team
Possible causes:
- Tests relying on timing.
- Race conditions.
- End to end tests are delicate by nature and can fail randomly due to the complexity of all involved components.

Actions

Immediately restart the faulty step.
- 💡 It will save you time while you’re looking at the logs.
- Is the step passing now?
  - Yes: See Spotted a flake scenario
  - No: Give it another try, and see next point.
Check on Grafana if there are any occurrences of the failures that were previously observed:
Go the the “Explore” section
Make sure to select grafanacloud-sourcegraph-logs in the dropdown at the top of page.
Scope the time window to 7 Days to make sure to find previous occurrences if there are any
Enter a query such as {app="buildkite"} |= "your error message" where “your error message” is a string that identiy approximately the failure cause observed in the failing step.
Is there a build that failed exactly like this?
- Yes:
  1. 💡 Double check that you’re looking at that the same step by inspecting the labels of message (click on the line to make them visible)
  2. Yes, that’s a flake. See the Spotted a flake scenario
- No: it’s not a flake, reach out the team owning those tests.

You can also refer to the Loom walkthrough “how to find out if a CI failure is a recurring flake”.

Builds are not being created on Buildkite

Severity: major
Impact: It’s possible to merge a PR without going through CI. No builds are produced and it’s impossible to deploy the new commits.
Possible causes:
- GitHub is experiencing some outage that is affecting webhooks.
- Buildkite is experiencing some outage.
- Webhooks that trigger the builds have been deleted.

Actions

Inspect webhooks status on the sourcegraph/sourcegraph repository settings
If you’re not authorized to see this page, ping @dev-infra-support or escalate to @github-owners.
Check the status of the webhook, if it’s not green, something is wrong. However, if it is green it is no guarantee that the webhook is operating as usual! If GitHub Webhooks is experiencing degraded performance, it might not be emitting events to the endpoint at all any more, and the green status was the last submission before the outage started. See the next step to verify the status of Webhooks.
Check GitHub Status
Check Buildkite Status
A possible way to mitigate a GitHub outage is to recreate the webhook.
Delete the old buildkite webhook.
Create a new one by following these instructions.

SSH into an agent

Gravity: none
Impact: none (unless a destructive action is performed)
Possible cause:
- Need to investigate a problem and suspect the agent is at fault

Actions

Identify if you want to look at a Bazel agent or a stateless one. Bazel agents are under the buildkite-bazel namespace, and stateless agents are under buildkite namespace.
Request access to the DevX Day2Day entitle bundle by typing /access_request in Slack.
Find the pod you want to SSH into with one of the following methods:
1. Use kubectl get pods -n $NAMESPACE -w to observe the currently running agents and get the pod name (k9s works here too).
2. From a Buildkite build page, click the “Timeline” tab of a job and see the entry for “Accepted Job”. The “Host name” in the entry is also the name of the pod that the job was assigned to.
Use kubectl exec -n $NAMESPACE -it buildkite-agent-xxxxxxxxxx-yyyyy -- bash to open a shell on the Buildkite agent.

Replacing Agents

Gravity: minor
Impact: May fail ongoing builds, but that’s fine.
Possible causes:
- Newer version of the agents needs to be deployed.

Actions

Refer to the instructions here to remove currently deployed agents. The buildkite-job-dispatcher will deploy jobs with any updated config.

Agent availability issues

Gravity: major
Impact: Builds stuck in “waiting for agent”
Possible cause:
- Agent dispatch malfunction or GCP infrastructure outage

Actions

Check dispatcher dashboard for health metrics
Check dispatched agents for availability issues
Check dispatcher logs for details

For more details, see the source: buildkite-job-dispatcher

Continuous integration playbook

Prerequisites

Scenarios

buildchecker has locked the main branch

Actions

Build has failed on the main branch

Actions

Builds are all failing on the main branch with the same error

Actions

Build are failing on the main branch with different errors

Actions

Builds are all failing in my branch, on Bazel jobs, with many timeouts or cache/disk related errors or container errors.

Actions

Spotted a flake

Actions

Is this a failure or a flake?

Actions

Builds are not being created on Buildkite

Actions

SSH into an agent

Actions

Replacing Agents

Actions

Agent availability issues

Actions

`buildchecker` has locked the `main` branch

Build has failed on the `main` branch

Builds are all failing on the `main` branch with the same error

Build are failing on the `main` branch with different errors