Continuous integration infrastructure

This page consolidates resources regarding our CI infrastucture, namely our Buildkite agents fleet. This infrastructure is maintained by the DevInfra team.

Related resources:

Buildkite agents

We maintain a shared fleet of Buildkite agents for continuous integration across all repositories.

Buildkite agent queues

We have several different types of agents available. We recommend explicitly declaring which type of agent you want your jobs to run on with the agents: { queue: "standard" } field in your pipeline configuration.

The currently available queues:

  • standard: our default Buildkite agents, which are stateless, currently Docker-in-Docker agents running in Kubernetes
    • Use those for any non Bazel task, as they ensure that any state leak won’t affect further builds by design.
  • bazel: our Bazel Buildkite agents, which are stateful, currently Docker-in-Docker agents running in Kubernetes
    • Use those for any Bazel task, as Bazel guarantees hermeticity, meaning that a given build won’t affect subsequent build on the same agent.
  • macos: a stateful agent currently backed by a single host running MacOS. GCP does not provide instances which run MacOS which is why the host for this agent can be found in AWS us-ohio-2 region.
  • vagrant: special Buildkite agents desgined to run resource intensive test on docker deployments.

buildkite-job-dispatcher

Our Buildkite agents are stateless, and are deployed in batches as Kubernetes jobs where each agent runs its workload and exits based on the size of the Buildkite backlog. This is managed by the buildkite-job-dispatcher:

Another potentially fragile component of this system is buildkite-git-references, which is a cron job and set of GCP disks that speed up pipelines by reducing the amount of cloning required.

Relevant runbooks:

A diagram overview of how the buildkite-job-dispatcher works (diagram adapted from here):

sequenceDiagram
    participant ba as buildkite-job-dispatcher
    participant k8s as CI Kubernetes cluster
    participant bk as Buildkite.com
    participant gh as GitHub.com

    loop
      gh->>bk: enqueue jobs
      activate bk

      ba->>bk: list queued jobs and total agents
      bk-->>ba: queued jobs, total agents

      activate ba
      ba->>ba: determine required agents
      alt queue needs agents
        ba->>k8s: get template Job
        activate k8s
        k8s-->>ba: template Job
        deactivate k8s

        ba->>k8s: get buildkite-git-references volume
        activate k8s
        k8s-->>ba: volume
        deactivate k8s

        ba->>ba: modify Job template

        ba->>k8s: dispatch new Job
        activate k8s
        k8s->>bk: register agents
        bk-->>k8s: assign jobs to agents

        loop while % of Pods not online or completed
          par deployed agents process jobs
            k8s-->>bk: report completed jobs
            bk-->>gh: report pipeline status
            deactivate bk
          and check previous dispatch
            ba->>k8s: list Pods from dispatched Job
            k8s-->>ba: Pods states
          end
        end
      end
      deactivate ba

      k8s->>k8s: Clean up completed Jobs

      deactivate k8s
    end