How gitserver works

Purpose
- Overview
  - Scheduling repository updates
Miscellaneous

Purpose

Sourcegraph mirrors repositories from code hosts. Code hosts may be SaaS products, such as GitHub or AWS CodeCommit, or local installations that are private to a customer’s environment. The gitserver service is responsible for providing HTTP-based access to (and management of) all repositories that are made accessible via configured code hosts.

Overview

Each gitserver instance exposes an HTTP server as its primary interface. This interface allows clients to clone a repository onto a particular instance, and then direct subsequent Git commands to that repository on that instance.

Generic Git commands are processed using an exec endpoint, which avoids having to implement every possible Git operation in the HTTP server. Sourcegraph-specific commands and queries, such as whether a repository is cloneable, are routed through purpose-built HTTP handlers.

The service supports different version control systems, or VCS, that are compatible with Git. The implementation details of these systems are abstracted by a VCSSyncer interface. For example, the steps to clone a Git repository differ from those of a Perforce repository.

Scheduling repository updates

One repository-related responsibility not handled by gitserver is the scheduling of syncs and updates. To clone repositories, clients interact with repo-updater; that service provides a priority queue and schedulers that determine when clones and fetches will take place. Other Git commands are sent directly from clients to gitserver.

See “How repo-updater works: Overview” for more information.

Miscellaneous

Production instances

There are currently 10 gitserver instances in production on sourcegraph.com.

At the moment, we shard repositories across gitserver instances using a modular hashing strategy based on the repository name. This is the responsibility of the gitserver client.

A modular hashing strategy has two important implications: repositories can only reside on exactly one gitserver, and a substantial number of repositories need to be relocated to another gitserver if the membership list changes.

In the future, we will shift to a consistent hashing strategy that will provide high availability to repositories and minimize expensive moves. That will also allow us to have a more dynamic set of gitserver instances at any time.

Command timeouts

There are two different command timeout checks in gitserver: the short timeout and the long timeout. Note that some commands are expected to take a considerable amount of time, especially when executed against large monorepos (some of which can be multiple TB in size).

Concurrency control

Repositories can be locked during sensitive operations to prevent concurrent activities from taking place, such as two clones of the same repository on the same instance. Locks operate at the directory level on a single instance, since repositories are not replicated across instances today.

Cleanup tasks

Each gitserver instance will perform various background cleanup tasks to ensure that repositories remain healthy, or are removed if they are found to be corrupt.

Additionally, gitserver may remove repositories if the instance’s disk is under heavy pressure. The least recently used repositories will be removed first, until a sufficient amount of space has been reclaimed. This will only trigger if the free disk space available is lower than the threshold set in the environment variable SRC_REPOS_DESIRED_PERCENT_FREE, set to 10% by default.

Useful metrics

We track a variety of metrics in gitserver that you’ll want to familiarize yourself with. For example: