Derived from Google Docs.
To better load balance support for high priority services in go/whodoinotify, the infrastructure org is implementing a shared on-call rotation.
The on-call engineer is responsible for carrying the pager and respond to high priority issues (P0/P1 from TS, P1/P2 on Opsgenie). The on-call engineer is not expecting to handle non-critical operational work, they will still be delegated to the product team.
Services in priority support category should have higher prority than other systems in the case of simultaneous outage. Either by prioritizing them over other systems, or engage secondary or team members to handle non-priority support system failure.
The source of truth for prioritization is go/whodoinotify, as this page might be out of date.
As an on-call engineer, you are expected to respond to pages and follow playbook to handle expected failure or hints to triage unexpected production issue. You are not expected to solve the issue on your own, but you should be the first responder. If it is approaching your end of shift or you are unavaiable to handle the issue, kindly acknowledge the page and delegate to your teammates.
Each shift is 7-day long, starting and ending on Monday 9AM UTC.
At the start of their shift, the Incoming Engineer On-Call will prepare to take over support rotation duties from the Outgoing Engineer On-Call. The
This document is designed to help the Incoming Engineer On-Call get up to speed with any open issues that require their attention during their shift. Additionally, this is designed to provide a mechanism for the Outgoing Engineer On-Call to offload support duties in order to transition back into project work.
Upon reviewing the notes, the Incoming Engineer On-Call should create a new entry in go/infra-on-call from the template.
This can be done in the OpsGenie interface by supplying your cellphone number or by downloading the OpsGenie app. The method you choose to be contacted is up to you as long as your response time follows the Response Time guidelines.
We recommend setting a 5-10 min delay in your OpsGenie during your sleep hours. Usually teammates in the opposite timezone should be able to acknowledge the page soon enough to avoid escalation. We expect this to get better once we are able to implement a follow-the-sun schedule.
- Respond to pages from OpsGenie
- Keep notes about any incidents, alerts, and priority support issues during your shift in go/infra-on-call
The engineer that is starting their support rotation.
The engineer that is ending their support rotation.
The engineer that is currently support rotation.