Our current on-call alerting system is run via Opsgenie. This system will send notifications via voice call, SMS, email, and Slack. Teams at Sourcegraph that have production systems where they need to alerted to potential issues should have an Opsgenie rotation.
Access to Opsgenie is granted automatically to all Engineering teammates and you may login with your OKTA credentials at https://sourcegraph.app.opsgenie.com/.
After logging in to Opsgenie for the first time, work with your EM to join a team and participate in the rotation.
Creating an Opsgenie team
Admins are able to view the current Opsgenie teams and create new teams. Engineering managers should have access to this page (notify the DevOps team if that is not the case).
Creating an Opsgenie rotation
See the official Opsgenie docs.
Ensure that the first escalation after the on-call engineer is to “Alert the entire team”. Ensure that the final escalation policy is set to “Send to the DevOps team” or “Send to all teams” bases on severity.
Creating split rotations
Otherwise known as “follow-the-sun” rotations, split on-call rotations are a great way to leverage distributed, worldwide teams to make sure no one has to wake up in the middle of the night! If you need a “follow-the-sun” rotation with additional flexibility, check out this approach.
For example, if half your team is in North America and the other half of your team is in parts of Europe, you could leverage on-call rotations based on an east and west side of the Atlantic grouping, such as:
- a UTC-8 PT (Pacific Time) group covering UTC - (PT 11am - 11pm)
- a UTC+1 CET (Central European Time) group covering UTC - (CET 8am - 8pm)
To set this up in Opsgenie:
- Under your schedule, click “Create a new rotation”, setting it to the desired interval (e.g. “weekly”) and selecting the “Restrict to time intervals” option. Configure the interval’s start and end times to match your first group. Make sure to update the time in “Start on” to match the interval’s start time.
- Repeat the above, except set “Restrict to time intervals” times to match the second group, again ensuring the time in “Start on” matches the interval’s start time.
- Assign the appropriate teammates to both of the above.
And that’s it! The resulting schedule should look like this, using the example groupings above:
Alerts on Cloud
Opsgenie alerts on Cloud are configured in the following way:
- The site-config
- The Opsgenie team
- The ObservableOwner
(This process may change with #34861)
Integration with Slack
Alerts can be sent to specific slack channels using the Opsgenie slack integration.
Slackgenie is used to update a slack user group with members of a support rotation - e.g., updating the delivery-support slack handle so that notifications go directly to the on-call user.