Telemetry Gateway infrastructure operations

This document describes operational guidance for Telemetry Gateway infrastructure. This service is operated on the Managed Services Platform (MSP).

If you need assistance with MSP infrastructure, reach out to the Core Services team in #discuss-core-services.

Service overview

PROPERTYDETAILS
Service IDtelemetry-gateway (specification)
Ownerscore-services
Service kindCloud Run service
Environmentsdev, prod
Docker imageindex.docker.io/sourcegraph/telemetry-gateway
Source codegithub.com/sourcegraph/sourcegraph - cmd/telemetry-gateway

The Telemetry Gateway service is the service that ingests telemetry v2 events from all Sourcegraph instances, as well as other managed services.

  • For Sourcegraph instances that prior to 5.2.0, no events are exported to Telemetry Gateway, though legacy mechanisms may exist, e.g. for Cloud instances.
  • As of 5.2.0, certain flags can be configured to export events that have been instrumented with the new APIs to Telemetry Gateway.
  • As of 5.2.1, for existing licenses, export is enabled by default for Cody events only - for new licenses, export is enabled for all events. Some license tags can be configured to disable telemetry export in various degrees - see the original Telemetry Export rollout plan.

For discussion around telemetry V2 adoption, please reach out to #wg-v2-telemetry. For discussion around the Telemetry Gateway service, please reach out to #discuss-core-services. For more information, also see:

Querying events

Please reach out to #discuss-analytics for assistance in querying the dataset - Telemetry Gateway only handles ingestion and forwarding data to pipelines operated by the Data Analytics team.

Debugging missing Sourcegraph instance events

  1. Check for a license tag on the instance’s license that disables events - see the original Telemetry Export rollout plan.
    1. Note that external_url export was not added until 5.2.6+ - finding events for older instances require searching events by instance ID.
  2. Check for pings, as that mechanism has not changed, and validate that the instance is is on 5.2.1+
  3. If the above don’t reveal anything, reach out to #discuss-core-services for further debugging at the Telemetry Gateway level.

Custom metrics

The Telemetry Gateway exports some custom metrics for diagnostics on event ingestion volume as well as export size distributions. These metrics indicate what kinds of event volumes Sourcegraph instances are emitting.

The production Telemetry Gateway instance has custom metrics dashboard defined in GCP monitoring: Telemetry Gateway - Custom Metrics

Rollouts

PROPERTYDETAILS
Delivery pipelinetelemetry-gateway-us-central1-rollout
Stagesdev -> prod

Changes to Telemetry Gateway are continuously delivered to the first stage (dev) of the delivery pipeline.

Promotion of a release to the next stage in the pipeline must be done manually using the GCP Delivery pipeline UI.

Environments

dev

PROPERTYDETAILS
Project IDtelemetry-gateway-dev-0050
Categorytest
Deployment typerollout
Resources
Slack notifications#alerts-telemetry-gateway-dev
Alert policiesGCP Monitoring alert policies list, Dashboard
ErrorsSentry telemetry-gateway-dev
Domaintelemetry-gateway.sgdev.org

MSP infrastructure access needs to be requested using Entitle for time-bound privileges. Test environments may have less stringent requirements.

For Terraform Cloud access, see dev Terraform Cloud.

dev Cloud Run

The Telemetry Gateway dev service implementation is deployed on Google Cloud Run.

PROPERTYDETAILS
ConsoleCloud Run service
Service logsGCP logging
Service tracesCloud Trace
Service errorsSentry telemetry-gateway-dev

You can also use sg msp to quickly open a link to your service logs:

sg msp logs telemetry-gateway dev

dev Architecture Diagram

Architecture Diagram

dev Terraform Cloud

This service’s configuration is defined in sourcegraph/managed-services/services/telemetry-gateway/service.yaml, and sg msp generate telemetry-gateway dev generates the required infrastructure configuration for this environment in Terraform. Terraform Cloud (TFC) workspaces specific to each service then provisions the required infrastructure from this configuration. You may want to check your service environment’s TFC workspaces if a Terraform apply fails (reported via GitHub commit status checks in the sourcegraph/managed-services repository, or in #alerts-msp-tfc).

To access this environment’s Terraform Cloud workspaces, you will need to log in to Terraform Cloud and then request Entitle access to membership in the “Managed Services Platform Operator” TFC team. The “Managed Services Platform Operator” team has access to all MSP TFC workspaces.

The Terraform Cloud workspaces for this service environment are grouped under the msp-telemetry-gateway-dev tag, or you can use:

sg msp tfc view telemetry-gateway dev

prod

PROPERTYDETAILS
Project IDtelemetry-gateway-prod-acae
Categoryexternal
Deployment typerollout
Resources
Slack notifications#alerts-telemetry-gateway-prod
Alert policiesGCP Monitoring alert policies list, Dashboard
ErrorsSentry telemetry-gateway-prod
Domaintelemetry-gateway.sourcegraph.com

MSP infrastructure access needs to be requested using Entitle for time-bound privileges.

For Terraform Cloud access, see prod Terraform Cloud.

prod Cloud Run

The Telemetry Gateway prod service implementation is deployed on Google Cloud Run.

PROPERTYDETAILS
ConsoleCloud Run service
Service logsGCP logging
Service tracesCloud Trace
Service errorsSentry telemetry-gateway-prod

You can also use sg msp to quickly open a link to your service logs:

sg msp logs telemetry-gateway prod

prod Architecture Diagram

Architecture Diagram

prod Terraform Cloud

This service’s configuration is defined in sourcegraph/managed-services/services/telemetry-gateway/service.yaml, and sg msp generate telemetry-gateway prod generates the required infrastructure configuration for this environment in Terraform. Terraform Cloud (TFC) workspaces specific to each service then provisions the required infrastructure from this configuration. You may want to check your service environment’s TFC workspaces if a Terraform apply fails (reported via GitHub commit status checks in the sourcegraph/managed-services repository, or in #alerts-msp-tfc).

To access this environment’s Terraform Cloud workspaces, you will need to log in to Terraform Cloud and then request Entitle access to membership in the “Managed Services Platform Operator” TFC team. The “Managed Services Platform Operator” team has access to all MSP TFC workspaces.

The Terraform Cloud workspaces for this service environment are grouped under the msp-telemetry-gateway-prod tag, or you can use:

sg msp tfc view telemetry-gateway prod

Alert Policies

The following alert policies are defined for each of this service’s environments.

High Container CPU Utilization

High CPU Usage - it may be neccessary to reduce load or increase CPU allocation

Severity: WARNING

High Container Memory Utilization

High Memory Usage - it may be neccessary to reduce load or increase memory allocation

Severity: WARNING

Container Startup Latency

Service containers are taking longer than configured timeouts to start up.

Severity: WARNING

Cloud Run Pending Requests

There are requests pending - we may need to increase Cloud Run instance count, request concurrency, or investigate further.

Severity: WARNING

Cloud Run Instance Precondition Failed

Cloud Run instance failed to start due to a precondition failure.
This is unlikely to cause immediate downtime, and may auto-resolve if no new instances are created and/or we return to a healthy state, but you should follow up to ensure the latest Cloud Run revision is healthy.

Severity: WARNING

External Uptime Check

Service is failing to repond on https://telemetry-gateway.sourcegraph.com - this may be expected if the service was recently provisioned or if its external domain has changed.

Severity: CRITICAL

Container Instance Count

There are a lot of Cloud Run instances running - we may need to increase per-instance requests make make sure we won't hit the configured max instance count

Severity: WARNING