Managed Instance technical documentation
Please review the Managed Instances operations guide for instructions.
Sourcegraph upgrades every test and customer instances according to SLA.
The release process is performed in steps:
- New version is released via release guild
- GitHub issue in Sourcegraph repository is open based on the managed instances upgrade template.
- GitHub issue is labeled with
team/cloudand Cloud Team is automatically notified to perform Managed Instances upgrade. Label is part of the template.
- Cloud team performs upgrade of all instances in given order:
- for Instances with version v1.1
Stage Working days since release Action Condition not met? 1 0-2 Upgrade internal instances by Cloud Team (incl. demo and rctest) 2 3-4 Time for verification by Sourcegraph teams New patch created -> start from 1st stage 3 5-6 Upgrade: 30% trials 10% customers New patch created -> upgrade internal in 1 working day and start from 2nd stage 4 7-8 Upgrade: 100% trials 40% customers New patch created -> upgrade internal in 1 working day and start from 3rd stage 5 9-10 Upgrade: 100% customers New patch created -> upgrade internal in 1 working day and start from 3rd stage
After upgrade of every single instances Uptime checks are verified. This includes automated monitoring
- tracking issue - 3.40.1.
- GitHub Pull Requests for 3.40.1 upgrade
Release process for patch releases
With bi-weekly patch release schedule, Cloud Team is using simplified release process to ensure Cloud customers can obtain patch as soon as possible.
Known limitations of managed instances
Sourcegraph managed instances are now running on Kubernetes, specifically GKE, today.
Current Cloud architecture has been tested to support a workload of >100000 repositories (440GB Git storage) and 10000 simulated users on a
- Isolation: Each managed instance is created in an isolated GCP project with heavy gcloud access ACLs and network ACLs for security reasons.
- Admin access: Both the customer and Sourcegraph personnel will have access to an application-level admin account. Learn more about how we ensure secure access to your instance.
- VM/SSH access: Only Sourcegraph personnel will have access to the actual GCP environment, this is done securely through GCP IAP TCP proxy access only. Sourcegraph personnel can make changes or provide data from the environment upon request by the customer.
- Inbound network access: The customer may choose between having the deployment be accessible via the public internet and protected by their SSO provider, or for additional security have the deployment restricted to an allowlist of IP addresses only (such as their corporate VPN, etc.). Filtering of the IP allowlist is performed by our WAF provider, Cloudflare. Notes, in addition to the customer provided IP allowlist, traffic from well-known public code hosts (e.g. GitHub.com) is also permitted to access selected Sourcegraph endpoints to ensure functionality of certain features.
- Outbound network access: The Sourcegraph deployment will have unfettered egress TCP/ICMP access, and customers will need to allow the Sourcegraph deployment to contact their code host. This can be done by having their code-host be publicly accessible, or by allowing the static IP of the Sourcegraph deployment to access their code host.
- Web Application Firewall (WAF) protections: All managed instances are proxied through Cloudflare and leverage security features such as rate limiting and the Cloudflare WAF.
Access can be requested in #it-tech-ops WITH manager approval.
Monitoring and alerting
Each managed instance is created in an isolated GCP project. System performance metrics are configured and collected in scoped project. All metrics can be seen in scoped projects dashboard.
Every customer managed instance has alerts configured:
- cloud provider-managed uptime check is configured in dedicated GCP managed instance project
- instance performance metrics alerts configured in scoped project for all managed instances, every v2.0 instance is added via code
- additional v2.0 infrastructure pefrormance metrics configured per instance
- application performance metrics - based on application log events
When alert is triggered, it is sent to Opsgenie channel:
From Opsgenie, alert is sent to on-call Cloud and Slack channels (
On-call Cloud engineer has to decide, what is the alert type and if incident should be opened and follow the procedure to perform the incident. On-call Cloud engineer should use managed instances operations to check, assess and repair broken managed instance.
When alert is closed via incident resolution, post-mortem actions has to be assigned and performed.
Opsgenie alerts Sample managed instance incident - customer XXX is down.
Please visit go/cloud-ops