Cloud

The Cloud team is the special focus team reporting directly to CEO modeled on “if AWS were to offer ‘Managed Sourcegraph’ like they do Elasticsearch, Redis, PostgreSQL, etc., how would they do it?” The team is responsible for maintaining existing managed instances and building the next generation of them. The Cloud team has no other responsibilities.

Members

Mission statement

Build a fully managed platform for using Sourcegraph that can (by EOFY23) support 200+ customers using dedicated Sourcegraph instances, providing feature compatibility with self-hosted while being cost-efficient for customers and Sourcegraph.

Fully managed

  • Observability allowing Sourcegraph to react before user impact is noticed, while respecting user privacy
  • Frequent, invisible Sourcegraph upgrades
  • Invisible infrastructure updates
  • Zero infrastructure access for customers

Platform

  • Low customer onboarding cost
  • Zero customer maintenance cost
  • Secure (SOC 2, documented security posture)
  • Reliable (ability to offer SLA, internal SLO of 99.9%)
  • Automatable (in due time, feature releases / billing / upgrades / analytics are built-in)

Support 200+ customers

  • Targeting 200+ customers in to invest in supporting 1000 in
    • Support 300 production-grade instances (accommodating trials / testing)
  • Compatible with current MI use cases
    • Infrastructure / Domain / Isolation boundary per customer

Dedicated Sourcegraph instances

  • One Sourcegraph instance serves a single customer
  • (/) Dedicated, Sourcegraph-provided Cloud infrastructure
  • (/) GCP only

Feature compatibility

  • Feature set on-par with self-hosted
    • With time, getting more powerful than self-hosted
  • Features are opt-in (for a fee)
  • New features available on Cloud before self-hosted
  • Existing features have higher adoption on Cloud than self-hosted

Cost-efficiency

  • Expected to support teams from 50 to 5000 users (EOFY23) at 500$/month minimal infrastructure cost
  • Infrastructure cost covered by Sourcegraph
  • Administration / operations provided by Sourcegraph
  • () Self service provisioning / release channels for upgrades

Not in scope (for / ):

  • supporting customer provided GCP infrastructure
  • supporting cloud providers other than GCP
  • managing Sourcegraph installations in clusters not provisioned by the Cloud team (Bring-your-own-Kubernetes)
  • supporting customers smaller than X1 ARR
  • optimizing cost below X2 $/month

Roadmap

The Cloud team roadmap in available here.

Q3FY23 goals

  • Support the initiative to make the Cloud a preferred deployment method from the platform and infrastructure perspective
  • Cloud v2 - migrate current the Cloud (managed instances) from single-VM, Docker Compose based architecture to multi-node, GKE based architecture

How to contact the team and ask for help

  • For emergencies and incidents, alert the team using Slack command /genie alert [message] for cloud and optionally tag the @cloud-support handle.
  • For internal Sourcegraph teammates, join us in #cloud slack channel to ask questions or request help from our team.
  • For managed instance requests or requests for help that requires action for the Cloud team engineers (exp. coding, infrastructure change etc.) please create a GH issue and assign a team/cloud label. You can also post a follow up message on the #cloud slack channel
  • You may tag the @cloud-support handle if you are looking for immediate attention, and it will notify our on-call engineers. Please avoid tagging/DM a specific teammate or the @cloud-team handle, this is to try and protect their focus.

Managed Instance

When to offer a Managed Instance

See below for the SLAs and Technical implementation details (including Security) related to managed instances.

Please message #cloud for any answers or information missing from this page.

When offering customers a Managed Instance, CE and Sales should communicate and gather information for the following topics

  • Customers are comfortable with security implication of using a managed instance
  • Customers’ code host should be accessible publically or able to allow incoming traffic from Sourcegraph-owned static IP addresses. (Notes: we do not have proper support for other connectivity methods, e.g. site-to-site VPN)

Trial Managed Instances (aka PoC)

Documentation

Managed Instance Requests

Customer Engineers (CE) or Sales may request to:

  • Add IP(s) to a Managed Instance Allowlist - [Issue Template]
    • For Customers who have IP restrictions to their MI and would like to add a new list of IP(s) or CIDR
  • Create a managed instance - [Issue Template]
  • Suspend a managed instance - [Issue Template]
    • For customers or prospects who currently have a managed instance that needs to pause their journey, but intend to come back within a couple of months.
  • Tear down a managed instance - [Issue Template]
    • For customers or prospects who have elected to stop their managed instance journey entirely. They accept that they will no longer have access to the data from the instance as it will be permanently deleted.
  • Extend trial Managed Instance issue - [Issue Template]
    • For prospects who needs to extend the trial.
  • Convert Trial Managed Instance to paid issue - [Issue Template]
    • For prospects who sign the deal after trial expires.
  • Enable telemtry on a managed instance - [Issue Template]
    • For customers or prospects who currently do have a managed instance and you would like to enable collection of user-level metrics.
  • Disable telemtry on a managed instance - [Issue Template]
    • For customers or prospects who currently do have a managed instance and you would like to disable collection of user-level metrics.

Workflow

  1. CE seeks Managed Instance approval from their regional CE Manager
  2. The Regional CE Manager will review the following criteria:
    • Overall, is the deal qualified?
    • Is it technically qualified? We have documented POC success criteria and the customer agrees to the criteria. We have documented the basic technical requirements of the customer (languages, repo types, security, etc.)
    • If anything is non-standard, it must pass the tech review process
  3. If approved, then CE proceeds based on whether this is a standard or non-standard managed instance scenario:
    • For standard managed instance requests (i.e., new instance, no scale concerns, no additional security requirements), CE submits a request to the Cloud team using the corresponding issue template in the sourcegraph/customer repo.
    • For non-standard managed instance requests (i.e., any migrations, special scale or security requirements, or anything considered unusual), CE submits the opportunity to Tech Review before making a request to the Cloud team.
  4. Message the team in #cloud.
  5. If denied, the CE/AE can appeal through the CE/AE leadership chain of command.

Supporting Manage Instance

SLAs for managed instances

Support SLAs for Sev 1 and Sev 2 can be found here. Other engineering SLAs are listed below

SLA for internal requests may be extended during upstream service providers outage. For example, automated trial instance creation workflow relies on GitHub Actions and GitHub is down.

DescriptionResponse timeResolution time
New instance CreationSpin up new instance for a new customer1 working day1 working day from agreement
New Trial instance CreationSpin up new trial instance for a new customer1 working day1 working day
Existing instance suspensionSuspend an existing managed instance temporarilyWithin 24 hours of becoming aware of the needWithin 15 working days from agreement
Existing instance deletion/teardownDecommission/delete and existing managed instanceWithin 24 hours of becoming aware of the needWithin 15 working days from agreement
New Feature RequestFeature request from new or existing customersWithin 24 hours of becoming aware of the needDependent on the request
Maintenance: Monthly Update to latest releaseUpdating an instance to the latest releaseNAWithin 10 working days after latest release
Maintenance: patch/emergency release UpdateUpdating an instance with a patch or emergency releaseNAWithin 1 week after patch / emergency release
Add IP(s) to Managed InstanceAdd new list of IPs to MI allowlist1 working dayWithin 3 days

Agreement here is the date specified within the required Github issue

Recovery Time Objective and Recovery Point Objective (RTO & RPO)

We have a maximum Recovery Point Time objective of 24 hours. Snapshots are performed at-least daily on managed instances. Some components may have lower RPOs (e.g. database).

Our maximum Recovery Time Objective is defined by our support SLAs for P1 & P2 incidents.

Incident Response

Incidents which affect managed instances are handled according to our incidents process.

Accessing/Debugging Managed Instances

ActionWho can do itDescriptionHow
Reload configCE/CSReload MI site config (restart frontend)restart frontend
View GCP project metricsCloud/Security/All SG employees via policy attachmentAccess to all MI metrics aggregate in single projectGCP scoped dashboard
View GCP project logsCloud/Security/All SG employees via policy attachmentAccess customer GCP project logsGCP logs - change to proper customer name
GCP ssh, tunnel portsCloud/CS/EngineeringRequired for troubleshooting customer environment and perform pre-defined playbookinstall mi cli
ssh to MI
port-forward to MI
gcloudcommands
Access CloudSQL databaseCloud/Security/CS/EngineeringLogin to CloudSQL DBinstall mi cli
access CloudSQL via mi cli
gcloud commands
Login to customer MI web UICloud/CE/CS/EngineeringSite admin access to customer instanceSourcegraph teammate access to Cloud instances
Login to customer GrafanaCloud/CE/CS/EngineeringDONT, use centralized observabilitylearn more from centralized o11y
List Managed InstancesCloud/CEList Managed Instances, filtered by instance type (trial/production/internal) and (optionally) by responsible CElist Managed Instances

More Managed Instances can be found here

Processes

FAQ

FAQ: Can customers disable the “Builtin username-password authentication”?

Yes, you may disable the builtin authentication provider and only allow creation of accounts from configured SSO providers.

However, in order to preserve site admin access for Sourcegraph operators, we need to add Sourcegraph’s internal Okta as an authentication provider. Please reach out to our team prior to disabling the builtin provider.

FAQ: How do I restart the frontend after changing the site-config?

Are you a member of our CE & CS teams?

FAQ: What are Cloud plans for observability - can I see data from customer instances in Honeycomb / Grafana Cloud / X?

Cloud instances provisioned for customers provide the same monitoring data / tooling as all other Sourcegraph instances (Grafana/Prometheus for metrics, Jaeger for traces). GCP Logging is used to store / query logs written by Sourcegraph workloads, and GCP Monitoring is used for infrastructure-level metrics / uptime checks.

Access to data from Cloud instances is governed by Cloud Access Control Policy.

Long-term, we will collaborate with DevX team (as owners of Sourcegraph observability) to support monitoring / observability solutions that are qualified for use with customer data.

FAQ: What are Cloud plans for continuous deployment - how often do we deploy code to Cloud instances?

Cloud instances provisioned for customers run released Sourcegraph versions and are currently updated at least once a month (for minor releases), unless we need to deploy a patch release.

Sourcegraph-owned instances are continuously deployed (with versions that weren’t officially released), DevX team owns continuous deployment to those environments.

FAQ: What are Cloud plans for analytics - where can I see data from Cloud instances in Looker / Amplitude?

Cloud instances do not expose analytics data other than pings. Future work in this area is owned by Analytics team and managed through the “Improve our data collection” cross-functional project.

FAQ: Does Cloud support data migrations?

Cloud instances are generally created without any customer data (repos / code-host connections / code / user accounts / code insights etc.).

The Cloud team has an experimental process for importing data from on-premises / jointly-managed Sourcegraph instances, described here for MI v1.1.

FAQ: How to use mi cli for Managed Instances operations?

Follow sourcegraph/deploy-sourcegraph-managed/README.md

For #cloud engineers, run mi reset-customer-password -email <> and it will generate a 1password share link for you.

The password reset link expires after 24h, so it’s quite common that CE would have to generate a new link during the initial hand-off process.

If access to the instance is restricted, either via VPN or CIDR whitelist, please reach out to #cloud for assistance.

Otherwise, the CE responsible for the customer is added as site-admin, so CE can login with “Sourcegraph Employee” (Google Workspace) auth provider and reset customer admin password. Otherwise, please reach out to #cloud for assistance.

IMPORTANT: Please do not share the password reset url directly with the customer admin over email or slack. More context.

Open 1password, and create a new Secure Note item and paste the password reset url, then use the 1password share item feature to securely share the link with customer admin. Make sure you configure the following options while sharing the item:

  • Link expires after: 1 day
  • Available to: <insert customer admin email>

This ensures only the customer admin is able to gain access to the password reset url.

FAQ: I have a new feature I want to deploy to Cloud, how do I do that?

Read through our Cloud Cost Policy

FAQ: What are Cloud plans for analytics - where can I see data from Cloud instances in Looker / Amplitude?

Cloud instances do not expose analytics data other than pings. Future work in this area is owned by Analytics team and managed through the “Improve our data collection” cross-functional project.

FAQ: How to list trial, production or internal instances?

You can either use:

  • Github Action (ce email parameter is optional).
  • mi cli via command:
mi info --ce <NAME>@sourcegraph.com --instance-type [trial|production|internal] (both parameters are optional)

FAQ: What is the Cloud instance IP?

Use cases:

  • The customer would like to maintain an IP allowlist to permit traffic to their code hosts
  • The customer would like to maintain an IP allowlist to permit the use of their own SMTP service.

Outgoing traffic of Cloud instances goes through Cloud NAT with stable IPs. All IPs are reservered exclusively on a per customer basis.

There are two groups of IP.

  1. Primary outgoing IPs: This set of IPs is used by Sourcegraph to communicate directly with customer systems such as code hosts, authentication service, or SMTP service.
  2. (Optional) Executors outgoing IPs: This set of IPs is used by executors for all outgoing traffic. Executors is the technology that powers features like server-side batch changes and code navigation auto-indexing. Under normal circumstances, executors do not communicate directly with custoemr systems. When do customers need to add executors IP to their IP allowlist.
    • Customers are writing a batch change that commmunicates directly with the code host, e.g. run a custom script that invokes their on-prem GitLab instance API. If customers are only using SSBC to modify source code and allow Sourcegraph to handle the rest - commit and open PRs, they DO NOT need to whitelist executors IP.
    • Customers are using auto-indexing to index repos that use packages from private registries, e.g. NPM packages from self-hosted JFrog Artifactory, Go packages from self-hosted code hosts. (Notes, we do not support indexing repo that uses private packages yet, this is here for future reference)
    • Customers are using container images from private container registry in build steps during auto-indexing or SSBC. (Notes, we do not support private container registry yet, this is here for future referneces)

For #ce teammates, please review above content and reach out to #cloud with sufficient context.

For #cloud teammates, please run:

# Primary outgoing IPs
terraform output -json | jq -r '.cloud_nat_ips.value'
# Executors outgoing IPs
terraform show -json | jq -r '.. | .resources? | select(.!=null) | .[] | select((.address == "module.managed_instance.module.executors[0].module.networking.google_compute_address.nat[0]") and (.mode == "managed")) | .values.address'

FAQ: What code-hosts does Cloud support?

Cloud supports all code-hosts types (self-managed and Cloud-managed), but it currently requires the code-host to have a public IP. More context here

FAQ: What is the difference between air-gapped, private and public code hosts?

  • Air-Gapped Code Host is a code host that is physically isolated from the internet. For example the code host is deployed on a hardware (server) that is within customers office/private data center and the only way to connect to this code host is to be physically connected to this air-gapped network; a user has to be within the office and be connected to the air-gapped office network via ethernet cable of wi-fi. In this scenario the only option for Sourcegraph to work is on-prem deployment within the same air-gapped network and all users connect to Sourcegraph instance via local IP or local DNS. Please note cloud will never be able to support air-gapped code hosts as these are based on their physical isolation so it’s not technically feasible for a Cloud instance to access such code host.
  • Private Code Host is a code host deployed in a private network (for example AWS EC2 instance within VPC). To connect to this code host a user has to have access to the private network usually via VPC Peering, VPN, or tunneling)
  • Public Code Host is the code host that is publicly accessible on internet - a user can CURL it via IP or open the URL in the browser. This also includes a code host with a public interface but restricts access to IP allowlist. The Sourcegraph instance can access this code host without using VPC Peering, VPN or other methods. Of course, accessing this code host is protected by authentication and authorization mechanisms

FAQ: How do I figure out the GCP Project ID for a customer?

The best way to determine the project ID for a given customer is to look up the customer in the deploy-sourcegraph-managed repo using the following query on S2:

repo:^github\.com/sourcegraph/deploy-sourcegraph-managed$ file:config\.yaml lang:yaml customer: :[_\n]

The customer field should allow you identify the correct GCP project. If it’s still unclear, a Cloud team member can help on Slack in the #cloud channel.

Alternatively, users with gcloud access can run:

gcloud projects list --filter='labels.mi-security=true' --format="json(projectId,labels)"

and search the results for the customer name. The domain field should include the customer’s domain name allowing the project ID to be identified.