What an On-Call SRE Actually Does at 3AM When the Pager Goes Off

What an On-Call SRE Actually Does at 3AM When the Pager Goes Off — ThirdShiftPress

What an On-Call SRE Actually Does at 3AM When the Pager Goes Off

The phone doesn't ring like a phone. It does that specific tone you picked six months ago because every other tone started giving you a stress response in elevators. You're already sitting up before your eyes are open. Your partner doesn't even stir anymore. The dog might. Somewhere in a datacenter you have never visited, in a region you have never set foot in, a disk is full or a certificate has expired or a deploy from eleven hours ago has finally decided to show its true personality. You are the person who answers.

This is what that hour actually looks like, minute by minute, for anyone who has ever had to explain at a barbecue what they do for a living and watched the questioner's eyes glaze over by the second sentence.

The First Ninety Seconds: Triage Before Consciousness

You do not wake up like a normal person. You wake up into a state that the rest of the population only experiences during car accidents. Laptop open. Glasses on, or not, depending on whether the alert is a P1 or a P2. The screen is at brightness zero because at 3AM full brightness feels like a personal attack.

The first thing you check is not the alert. The first thing you check is whether the alert is real. Roughly thirty percent of pages are the monitoring system itself having a moment. A probe in one region hiccupped. A metric pipeline lagged. A threshold was set by someone in 2021 who has since left the company and the service it monitored was renamed twice. You learn to recognize the smell.

If it's real, you start a timer in your head. Not because anyone's watching — well, someone probably is, but that's not the point. You start it because the difference between a fifteen-minute incident and a ninety-minute incident is mostly about how quickly you stop staring and start typing. The dashboard loads. You scan four graphs at once: error rate, latency p99, saturation, and traffic. Two of them are wrong. You already have a guess.

The Runbook Lie

Every postmortem promises a better runbook. Every runbook, by month six, is partially a lie. The service has moved on. The flags have changed. The on-call rotation forgot to update the doc when they migrated from the old auth system. You open it anyway, because sometimes there's one good command in there that saves you twenty minutes, and because if you don't open it and something goes sideways, the postmortem is going to ask why you didn't open it.

What you actually do is read the runbook with one eye while your other eye is in the logs. You're looking for the shape of the problem, not the answer. Is this a single instance going bad, or all of them at once? Is it correlated with a deploy? Did anything change in the last hour, six hours, twenty-four hours? You check the change feed. There's always something in the change feed. The hard part is figuring out which change is the one.

This is the part of the job that doesn't fit on a resume. You are doing real-time pattern matching against six years of accumulated war stories, half of which happened to other people on other teams in other companies that you only heard about because someone wrote a blog post about it in 2019. You are a walking, half-conscious correlation engine. Nobody trained you for this. You trained yourself by being woken up enough times.

What "Mitigating" Actually Means

There is a sacred distinction in this job between mitigation and fix. Civilians don't understand it. Your manager understands it. Your skip-level might understand it, depending on their background. The CEO does not understand it and should not be allowed to.

Mitigation is what you do at 3AM. You roll back the deploy. You drain the bad region. You scale the service horizontally to absorb the spike. You flip the feature flag. You bump the connection pool. You restart the pod that has been quietly leaking memory for nine hours. You are not solving the problem. You are putting the problem in a box so you can go back to bed and solve it during business hours like a person.

The fix is for Tuesday. The fix involves a PR, code review, a staging deploy, maybe a design doc if it's load-bearing enough. The fix is calm. The fix is the part you'd actually enjoy if you weren't doing it on three hours of sleep with a coffee headache.

At 3AM you are not an engineer. You are a paramedic. You stabilize the patient. You hand them off. You write down what you did so the surgeon knows where you cut.

The Incident Channel Performance

If the incident escalates — if it's bigger than you, or if customers are noticing, or if revenue is bleeding — you open the incident channel. There's an etiquette to this and nobody teaches it. You learn by watching senior people do it during outages and then doing it yourself badly the first three times.

You announce what you see. You announce what you're trying. You announce the result. You do not speculate in writing. You do not say "I think it might be the database" unless you have a graph. You say "error rate elevated on service X starting 02:47 UTC, investigating correlation with deploy abc123." You become a different person in that channel. Flatter. More precise. The dry humor stays but the volume drops.

Other people start showing up. The senior on the team. Someone from the database group. Maybe someone from networking if the symptoms look like a network thing, even though it's almost never a network thing, but it's polite to check. They ask what they can do. You give them a thing. You do not let four engineers do the same diagnostic in parallel. You direct traffic. Without anyone calling you the incident commander, you are now the incident commander, and you will hand that off as soon as somebody more rested arrives.

The Part Nobody Sees

You restore service at 4:11 AM. Error rate is back to baseline. Latency is sulky but recovering. You write a short update in the channel. You set a calendar reminder for the postmortem. You make a note of three things you want to check in the morning: the alert that should have fired earlier, the runbook that lied to you, and the dashboard panel that was actively misleading because it averaged across regions.

Then you close the laptop and lie in bed for forty minutes, fully wired, unable to sleep, because your body is still flooded with the chemicals it produced to keep you sharp. You will be tired tomorrow in a way that coffee does not fix. Your standup will be quieter than usual. You will get a Slack message from someone in another timezone saying "saw the incident, nice work." You will react with a thumbs-up because you don't have the energy to type.

This is the part nobody outside the field understands. Not the technical part. The toll. The way an hour at 3AM costs you a day and a half of productivity. The way you start to flinch at certain notification sounds. The way you learn to dread Sundays because the Monday morning deploys are coming.

You do it anyway. You do it because somebody has to, and because there's a strange pride in being the person who can. Anyone can write a feature. Not everyone can stay coherent at 03:14 while production is on fire and four people in a Slack call are asking questions at the same time.

Q&A: Things People Ask About On-Call

How long does an incident actually take?

The median is probably twenty to forty minutes if you know what you're doing and the system is well-instrumented. The long tail is brutal. The worst ones go four hours. The very worst ones span shift changes.

Do you get paid extra for being on-call?

Depends on the company. Some pay a stipend per week of rotation. Some pay per page. Some pay nothing and call it "part of the job." How a company handles this tells you more about its engineering culture than any blog post the recruiting team has ever published.

What's the worst page you've ever gotten?

Everyone has their answer ready for this one. It's usually a story involving a cascading failure, a misleading dashboard, and a coworker who was technically asleep but you needed them anyway. Ask an SRE this question at a bar and budget twenty minutes.

Does it get easier?

Yes and no. The technical pattern-matching gets faster. The body never quite adjusts. You just get better at managing the cost.

The Quiet Hour After

The sun isn't up yet. The kitchen is dark. You make tea, because coffee at 4:30 is a mistake you have made before. The house is quiet in a way it never is during the day. The incident is closed. The dashboards are green. Somewhere, a few million users are loading a page and have no idea that an hour ago it almost didn't load at all, and that a stranger in soft pants made it work again before they ever noticed.

That's the job. Not the title on the business card, not the line in the org chart. The actual job. Someone has to be awake while the world sleeps, and tonight that someone was you.

Related from ThirdShiftPress