MSP Testbed infrastructure operations

This document describes operational guidance for MSP Testbed infrastructure. This service is operated on the Managed Services Platform (MSP).

If you need assistance with MSP infrastructure, reach out to the Core Services team in #discuss-core-services.

Service overview

PROPERTYDETAILS
Service IDmsp-testbed (specification)
Ownerscore-services
Service kindCloud Run service
Environmentstest, robert
Docker imageus.gcr.io/sourcegraph-dev/msp-example
Source codegithub.com/sourcegraph/sourcegraph - cmd/msp-example

This is a test environment used by the Core Services team for experimenting with MSP infrastructure changes. Each Core Services teammate generally focuses their experiments on an individual environment of this service.

Rollouts

PROPERTYDETAILS
Delivery pipelinemsp-testbed-us-central1-rollout
Stagestest -> robert

Changes to MSP Testbed are continuously delivered to the first stage (test) of the delivery pipeline.

Promotion of a release to the next stage in the pipeline must be done manually using the GCP Delivery pipeline UI.

Environments

test

PROPERTYDETAILS
Project IDmsp-testbed-test-77589aae45d0
Categoryinternal
Deployment typerollout
Resourcestest Redis, test PostgreSQL instance, test BigQuery dataset
Slack notifications#alerts-msp-testbed-test
Alert policiesGCP Monitoring alert policies list, Dashboard
ErrorsSentry msp-testbed-test
Domainmsp-testbed.sgdev.org
Cloudflare WAF

MSP infrastructure access needs to be requested using Entitle for time-bound privileges.

For Terraform Cloud access, see test Terraform Cloud.

test Cloud Run

The MSP Testbed test service implementation is deployed on Google Cloud Run.

PROPERTYDETAILS
ConsoleCloud Run service
Service logsGCP logging
Service tracesCloud Trace
Service errorsSentry msp-testbed-test

You can also use sg msp to quickly open a link to your service logs:

sg msp logs msp-testbed test

test Redis

PROPERTYDETAILS
ConsoleMemorystore Redis instances

test PostgreSQL instance

PROPERTYDETAILS
ConsoleCloud SQL instances
Databasesprimary

To connect to the PostgreSQL instance in this environment, use sg msp in the sourcegraph/managed-services repository:

# For read-only access
sg msp pg connect msp-testbed test

# For write access - use with caution!
sg msp pg connect -write-access msp-testbed test

test BigQuery dataset

PROPERTYDETAILS
Dataset Projectmsp-testbed-test-77589aae45d0
Dataset IDmsp_testbed
Tablesexample

test Architecture Diagram

Architecture Diagram

test Terraform Cloud

This service’s configuration is defined in sourcegraph/managed-services/services/msp-testbed/service.yaml, and sg msp generate msp-testbed test generates the required infrastructure configuration for this environment in Terraform. Terraform Cloud (TFC) workspaces specific to each service then provisions the required infrastructure from this configuration. You may want to check your service environment’s TFC workspaces if a Terraform apply fails (reported via GitHub commit status checks in the sourcegraph/managed-services repository, or in #alerts-msp-tfc).

To access this environment’s Terraform Cloud workspaces, you will need to log in to Terraform Cloud and then request Entitle access to membership in the “Managed Services Platform Operator” TFC team. The “Managed Services Platform Operator” team has access to all MSP TFC workspaces.

The Terraform Cloud workspaces for this service environment are grouped under the msp-msp-testbed-test tag, or you can use:

sg msp tfc view msp-testbed test

robert

MSP infrastructure access needs to be requested using Entitle for time-bound privileges. Test environments may have less stringent requirements.

For Terraform Cloud access, see robert Terraform Cloud.

robert Cloud Run

The MSP Testbed robert service implementation is deployed on Google Cloud Run.

PROPERTYDETAILS
ConsoleCloud Run service
Service logsGCP logging
Service tracesCloud Trace
Service errorsSentry msp-testbed-robert

You can also use sg msp to quickly open a link to your service logs:

sg msp logs msp-testbed robert

robert Redis

PROPERTYDETAILS
ConsoleMemorystore Redis instances

robert PostgreSQL instance

PROPERTYDETAILS
ConsoleCloud SQL instances
Databasesprimary

To connect to the PostgreSQL instance in this environment, use sg msp in the sourcegraph/managed-services repository:

# For read-only access
sg msp pg connect msp-testbed robert

# For write access - use with caution!
sg msp pg connect -write-access msp-testbed robert

robert BigQuery dataset

PROPERTYDETAILS
Dataset Projectmsp-testbed-robert-7be9
Dataset IDmsp_testbed
Tablesexample

robert Architecture Diagram

Architecture Diagram

robert Terraform Cloud

This service’s configuration is defined in sourcegraph/managed-services/services/msp-testbed/service.yaml, and sg msp generate msp-testbed robert generates the required infrastructure configuration for this environment in Terraform. Terraform Cloud (TFC) workspaces specific to each service then provisions the required infrastructure from this configuration. You may want to check your service environment’s TFC workspaces if a Terraform apply fails (reported via GitHub commit status checks in the sourcegraph/managed-services repository, or in #alerts-msp-tfc).

To access this environment’s Terraform Cloud workspaces, you will need to log in to Terraform Cloud and then request Entitle access to membership in the “Managed Services Platform Operator” TFC team. The “Managed Services Platform Operator” team has access to all MSP TFC workspaces.

The Terraform Cloud workspaces for this service environment are grouped under the msp-msp-testbed-robert tag, or you can use:

sg msp tfc view msp-testbed robert

Alert Policies

The following alert policies are defined for each of this service’s environments.

Cloud SQL - Connections

The number of Cloud SQL connections are approaching the maximum number of connections.
This can be caused by an increase in the number of active service instances.

Try increasing the 'resource.postgreSQL.maxConnections' configuration parameter.

Severity: WARNING

Cloud SQL - CPU Utilization

Cloud SQL instance CPU utilization is above acceptable threshold.

Severity: WARNING

Cloud SQL - Disk Utilization

Cloud SQL instance disk utilization is above acceptable threshold.

Severity: WARNING

Cloud SQL - Memory Utilization

Cloud SQL instance memory utilization is above acceptable threshold.

Severity: WARNING

Cloud SQL - Server Availability

Cloud SQL instance is down.

Severity: WARNING

Cloud SQL - Spike in Per-Query Lock Time

Cloud SQL database queries encountered lock times well above acceptable thresholds.

Severity: WARNING

Cloud SQL - Sustained Per-Query Lock Times

Cloud SQL database queries are encountering lock times above acceptable thresholds over a window.

Severity: WARNING

High Container CPU Utilization

High CPU Usage - it may be neccessary to reduce load or increase CPU allocation

Severity: WARNING

High Container Memory Utilization

High Memory Usage - it may be neccessary to reduce load or increase memory allocation

Severity: WARNING

Container Startup Latency

Service containers are taking longer than configured timeouts to start up.

Severity: WARNING

Cloud Redis - System CPU Utilization

Redis Engine CPU Utilization goes above the set threshold. The utilization is measured on a scale of 0 to 1.

Severity: WARNING

Cloud Redis - Standard Instance Failover

Instance failover occured for a standard tier Redis instance.

Severity: WARNING

Cloud Redis - System Memory Utilization

Redis System memory utilization is above the set threshold. The utilization is measured on a scale of 0 to 1.

Severity: WARNING

Cloud Run Pending Requests

There are requests pending - we may need to increase Cloud Run instance count, request concurrency, or investigate further.

Severity: WARNING

Cloud Run Instance Precondition Failed

Cloud Run instance failed to start due to a precondition failure.
This is unlikely to cause immediate downtime, and may auto-resolve if no new instances are created and/or we return to a healthy state, but you should follow up to ensure the latest Cloud Run revision is healthy.

Severity: WARNING

External Uptime Check

Service is failing to repond on https://msp-testbed-robert.sgdev.org - this may be expected if the service was recently provisioned or if its external domain has changed.

Severity: CRITICAL