Managed Instance technical documentation

architecture

Deployment type and scaling

Managed instances are Docker Compose deployments today - we do not currently offer Kubernetes (or multi-node) managed instances. See known limitations for more context on scalability of Cloud.

Environments

SOC2/CI-100

Internal instances

For each type of Managed Instances (v.1.1), Sourcegraph maintains separate test environments:

Internal instances are created for various testing purposes:

  • testing changes prior to the monthly upgrade on customer instances. upon a new release is made available, Cloud team will follow managed instances upgrade tracker (this is created prior to monthly upgrade) to proceed with upgrade process.
  • testing significant operational changes prior to applying to customer instances
  • short-lived instances for product teams to test important product changes. Notes: any teammate may request a managed instance through our request process
dev instance

This is a shared instance for the engineering organization to test unreleased versions, locally built images, or anything they would like to experiement with Managed Instances.

#cloud is responsible for the maintenance of infrastructure, including Cloud SQL and underlying VM. The team that is running the experiment is responsbile for keeping everyone updated on the experiement in #cloud and ensuring the application is working as intented. The team should consult the operation guide when interacting with the dev instance. (please backup the database and VM before doing anyting destructive)

tpgi instance

Learn more from sourcegraph/customer#958

s2 instance

This is the internal Cloud dogfood instance for the entire company. #dev-experience is responsible for rolling out nightly builds on this instance and #cloud is responsible for the maintenance of infrastructure, including Cloud SQL and underlying VM.

research.sourcegraph.com

Learn more from sourcegraph/customer#1221

Customer instances

All customer instances are considered part of the production environment and all changes applied to these customers should be well-tested in the test environment.

Upgrade process to new Sourcegraph version is also preceded with upgrading test instances - upgrade to v3.40.1.

Release process

SOC2/CI-100

Sourcegraph upgrades every test and customer instances according to SLA.

The release process is performed in steps:

  1. New version is released via release guild
  2. GitHub issue in Sourcegraph repository is open based on the managed instances upgrade template.
  3. Github issue is labeled with team/cloud and Cloud Team is automatically notified to perform Managed Instances upgrade. Label is part of the template.
  4. Cloud team performs upgrade of all instances in given order:
  • for Instances with version v1.1
    StageWorking days since releaseActionCondition not met?
    10-2Upgrade internal instances by Cloud Team (incl. demo and rctest)
    23-4Time for verification by Sourcegraph teamsNew patch created -> start from 1st stage
    35-6Upgrade: 30% trials 10% customersNew patch created -> upgrade internal in 1 working day and start from 2nd stage
    47-8Upgrade: 100% trials 40% customersNew patch created -> upgrade internal in 1 working day and start from 3rd stage
    59-10Upgrade: 100% customersNew patch created -> upgrade internal in 1 working day and start from 3rd stage

After upgrade of every single instances Uptime checks are verified. This includes automated monitoring

Sample upgrade:

Known limitations of managed instances

Sourcegraph managed instances are single-machine Docker-Compose deployments only. We do not offer Kubernetes managed instances, or multi-machine deployments, today. The main limitation of the current model is that the underlying GCP infrastructure outage could result in downtime, i.e. is it not a highly-available deployment.

Current Cloud architecture has been tested to support a workload of >100000 repositories (440GB Git storage) and 10000 simulated users on a n2-standard-32 VM.

Security

  • Isolation: Each managed instance is created in an isolated GCP project with heavy gcloud access ACLs and network ACLs for security reasons.
  • Admin access: Both the customer and Sourcegraph personnel will have access to an application-level admin account. Learn more about how we ensure secure access to your instance.
  • VM/SSH access: Only Sourcegraph personnel will have access to the actual GCP VM, this is done securely through GCP IAP TCP proxy access only. Sourcegraph personnel can make changes or provide data from the VM upon request by the customer.
  • Inbound network access: The customer may choose between having the deployment be accessible via the public internet and protected by their SSO provider, or for additional security have the deployment restricted to an allowlist of IP addresses only (such as their corporate VPN, etc.). Filtering of the IP allowlist is performed by our WAF provider, Cloudflare. Notes, in addition to the customer provided IP allowlist, traffic from well-known public code hosts (e.g. GitHub.com) is also permitted to access selected Sourcegraph endpoints to ensure functionality of certain features.
  • Outbound network access: The Sourcegraph deployment will have unfettered egress TCP/ICMP access, and customers will need to allow the Sourcegraph deployment to contact their code host. This can be done by having their code-host be publicly accessible, or by allowing the static IP of the Sourcegraph deployment to access their code host.
  • Web Application Firewall (WAF) protections: All managed instances are proxied through Cloudflare and leverage security features such as rate limiting and the Cloudflare WAF.

Access can be requested in #it-tech-ops WITH manager approval.

Monitoring and alerting

SOC2/CI-86 SOC2/CI-25

Each managed instance is created in an isolated GCP project. System performance metrics are configured and collected in scoped project. All metrics can be seen in scoped projects dashboard.

Every customer managed instance has alerts configured:

Alerting flow:

  1. When alert is triggered, it is sent to Opsgenie channel:
  1. From Opsgenie, alert is sent to on-call Cloud and Slack channels (#opsgenie, #cloud-internal).

  2. On-call Cloud engineer has to decide, what is the alert type and if incident should be opened and follow the procedure to perform the incident. On-call Cloud engineer should use managed instances operations to check, assess and repair broken managed instance.

  3. When alert is closed via incident resolution, post-mortem actions has to be assigned and performed.

Opsgenie alerts Sample managed instance incident - customer XXX is down.

Configuration management

Terraform is used to maintain all managed instances. You can find this configuration here: https://github.com/sourcegraph/deploy-sourcegraph-managed

All customer credentials, secrets, site configuration, app and user configuration—is stored in Postgres only (i.e. on the encrypted GCP disk). This allows customers to enter their access tokens, secrets, etc. directly into the app through the web UI without transferring them to us elsewhere.

Operations

Please review the Managed Instances operations guide for instructions.

Managed Instances v1.1 documentation can be found here

Managed Instances v2.0 documentation can be found here

List Trials

To list all current active Cloud trials, navigate to the Trial List GitHub Actions page and execute the workflow by clicking the Run Workflow button in the top right and filling out the prompt. The workflow may take a minute or two to execute. Refresh the page and select the latest execution, then click trial-list. The results should be printed under the list trials step.