Managed Instance technical documentation

Operations

Please review the Managed Instances operations guide for instructions.

Release process

SOC2/CI-100

Sourcegraph upgrades every test and customer instances according to SLA.

The release process is performed in steps:

  1. New version is released via release team
  2. GitHub issue in Sourcegraph Customer repository with the mi2 env create-tracking-issue -e prod $TARGET_VERSION command
  3. GitHub issue is labeled with team/cloud and Cloud Team is automatically notified to perform Managed Instances upgrade. Label is part of the template.
  4. Cloud team performs upgrade of all instances in given order:
StageWorking days since releaseActionCondition not met?
10-2Upgrade internal instances by Cloud Team (incl. demo and rctest)
23-4Time for verification by Sourcegraph teamsNew patch created -> start from 1st stage
35-6Upgrade: 30% trials 10% customersNew patch created -> upgrade internal in 1 working day and start from 2nd stage
47-8Upgrade: 100% trials 40% customersNew patch created -> upgrade internal in 1 working day and start from 3rd stage
59-10Upgrade: 100% customersNew patch created -> upgrade internal in 1 working day and start from 3rd stage

After upgrade of every single instances uptime checks are verified. This includes automated monitoring

Sample upgrade:

Release process for patch releases

With bi-weekly patch release schedule, Cloud Team is using simplified release process to ensure Cloud customers can obtain patch as soon as possible.

StageWorking days since releaseAction
10-2Patch internal instances by Cloud Team (incl. demo, clouddev and rctest)
23-5Patch trials and customer instances. Follow 10%, 40%, 100% in each group respectively

Known limitations of managed instances

Sourcegraph managed instances are now running on Kubernetes, specifically GKE, today.

Current Cloud architecture has been tested to support a workload of >100000 repositories (440GB Git storage) and 10000 simulated users on a n2-standard-32 VM.

Security

  • Isolation: Each managed instance is created in an isolated GCP project with heavy gcloud access ACLs and network ACLs for security reasons.
  • Admin access: Both the customer and Sourcegraph personnel will have access to an application-level admin account. Learn more about how we ensure secure access to your instance.
  • VM/SSH access: Only Sourcegraph personnel will have access to the actual GCP environment, this is done securely through GCP IAP TCP proxy access only. Sourcegraph personnel can make changes or provide data from the environment upon request by the customer.
  • Inbound network access: The customer may choose between having the deployment be accessible via the public internet and protected by their SSO provider, or for additional security have the deployment restricted to an allowlist of IP addresses only (such as their corporate VPN, etc.). Filtering of the IP allowlist is performed by our WAF provider, Cloudflare. Notes, in addition to the customer provided IP allowlist, traffic from well-known public code hosts (e.g. GitHub.com) is also permitted to access selected Sourcegraph endpoints to ensure functionality of certain features.
  • Outbound network access: The Sourcegraph deployment will have unfettered egress TCP/ICMP access, and customers will need to allow the Sourcegraph deployment to contact their code host. This can be done by having their code-host be publicly accessible, or by allowing the static IP of the Sourcegraph deployment to access their code host.
  • Web Application Firewall (WAF) protections: All managed instances are proxied through Cloudflare and leverage security features such as rate limiting and the Cloudflare WAF.

Access can be requested in #it-tech-ops WITH manager approval.

Monitoring and alerting

SOC2/CI-86 SOC2/CI-25

Each managed instance is created in an isolated GCP project, with exclusive resources.

Metrics are visible to Sourcegraph employees with access in a centralized GCP metrics scoping project project. All metrics can be seen in scoped projects dashboard.

Every customer managed instance has alerts configured:

Alerting flow:

  1. When alert is triggered, it is sent to Opsgenie channel:

  2. From Opsgenie, alert is sent to on-call Cloud and Slack channels (#opsgenie, #alerts-managed-instances.

  3. On-call Cloud engineer has to decide, what is the alert type and if incident should be opened and follow the procedure to perform the incident. On-call Cloud engineer should use the generated managed instances operations to check, assess and repair broken managed instance.

  4. When alert is closed via incident resolution, post-mortem actions has to be assigned and performed.

Opsgenie alerts Sample managed instance incident - customer XXX is down.

List Trials

Please visit go/cloud-ops