Hacker News new | past | comments | ask | show | jobs | submit login
What I tell people new to on-call (ntietz.com)
88 points by zdw 3 months ago | hide | past | favorite | 86 comments



What I tell people new to on-call: ask for overtime, time off, or some form of compensation (and don't go along with the "it's part of your standard compensation" bullshit).

On-call is you agreeing to give your time, your weekends, your freedom to your work. It's beyond the standard 8-5. Don't trade that for free.


I always tell my guys to take a whole day if they are paged at night.


do you work with any women or 100% men?


Not OP, but I'm from the Midwest, and around here guys has always (and seemingly will always) mean a non gender specific group of people. It's a shame that folks see it as a micro aggression these days, as we really aren't being offensive with it. It's just really, really hard to change the behavior when most of the Midwest (women included) are still perfectly fine referring to groups of people as guys.


If it makes you feel better, I have worked with lots of diverse groups and everyone understood guys as a gender neutral term. I think people are a lot meaner and nitpicky online than in real life


Sometimes girls get bothered about being called dude or bro.

For the former, I’m certainly not going to call you dudette. And it seems widely accepted that dude is gender neutral.

As for bro … I can understand why it bother girls, but I always remember my close friend who says bro so much he frequently uses it even with his wife!!


also lived in the midwest my whole life, mix of rural and cities; this is false


I work with many women, but I don't manage any


I’ll go as far as saying we need legislation to cover this.

I do not like on call. I don’t actually mind the work but I don’t like the entitlement companies have towards it.


On the one hand, sure, I like nice things. On the other, there is legislation to cover this, but you're exempted if you're making typical software engineer salaries (this varies state to state I think, it's not federal, but is fairly common).

I feel like unionization would be a better route to this sort of benefit than trying to get a legislature to pass a bill for a group of people that are already very privileged.


This legislation already exists in other countries. For example Swiss engineers have very specific on call requirements, and their salaries aren’t really less than Americans.


My last company offered me a “promotion” that included more accountability, mentoring young staff and on-call with zero clarity of what that looked like. Zero compensation change.

Needless to say I declined and left soon after.


"This could destroy your marriage and your family." seems like a good intro.

I still remember getting paged when I was explicitly NOT on the pagerduty rotation at my tenth anniversary dinner with my wife. Ruined the whole day. And I wasn't making particularly good money.

If you aren't free to do what you want, you should be getting paid like you're at work.


Did you have to ACK that? I don’t mean to be dubious, but sometimes I worry that the nature of this discussion forum amplifies the issue because commenters are disproportionately likely to act on the work instead of “I didn’t have my pager device.”


Funny that most companies have no problems with outsourcing roles, but for some reason many expect on-callers to be available outside business hours. They could easily hire someone on the opposite time-zone.


> If you aren't free to do what you want, you should be getting paid like you're at work.

You should be paid 2x - 3x when you're working overtime.


If you work a job where having one of these incidents once a year or more is "normal" then the dev team needs to devote most of its time to fixing that, or you need to change employers.


Another way to say: for most software jobs, customer-facing downtime is downstream of a development skill issue that can+should be fixed.

Some (many?) employers make this difficult, and you should try to leave them.


I think you mean organizational issue not development skill issue. If shit is constantly hitting the fan, that is the orgs fault, not the engineers


Yeah, probably.

What I mean to imply is that it is an issue that is naturally fixed by improved development, and that fixing does require development skill, but the organization can hamstring their developers to prevent them fixing the issue even if they could.


an incident only once a year is an absurd bar. I'm no fan of on call but ensuring that level of incident avoidance would force the company to move at glacial speeds, which is even worse over the long term than getting paged.

I think my sweet spot is somewhere between once a week and once a month, spread across the whole team.


an incident that requires immediate developer intervention, rather than waiting until tomorrow? It seems like you would have to go out of your way to create a system so fragile that this happened once a month


I worked at a telco that served a few tens of thousands of customers in a huge remote region.

There are so many systems held together with baling wire it was rare to go a day without a significant outage, usually multiple. Everyone who was remotely knowledgeable about tech was basically a firefighter.


I don't think this takes into account the reality of huge megacorps with tons of development teams situated globally who are constantly changing the codebase.

Incidents happen as code changes. Even once you fix it, the changing nature of the code can introduce more issues


I've never worked at a megacorp, but if megacorp employees believe that it is more acceptable for them to cause issues for customers than a 3-dev company, that really seems like a skill issue for the megacorp.

If it is unacceptable to cause that downtime, you write code that makes the downtime much less likely


I expect the scale here is not apples to apples. A three person team is often on a small product and downtime is often a catastrophe like truly broken for customers. Meanwhile a megacorp is often many many large products and downtime usually means a piece of one of them is degraded.

My random guess is that the "downtime" is fairly proportional to the scale difference with megas probably taking the edge.


Often it's from slippage between 2 teams systems where a contract never existed. Often even the relationship causing the incident is unclear.


Good advice in this article, especially the bits about communicating every 15-30 minutes depending on severity. Comms are invaluable for timeline/postmortems.

Also, for secondary/shadow on-calls, you will need to remind the primary to loop you in, as they will be busy.

Try not to be on-call too often, but also try not to be on-call too little. You need exposure to the latest types of events happening and don't want to get rusty. Once every 1.5 months is a good balance for me.


Just a question for other people who have been in industry longer. I'm somewhat new, wondering if my company's oncall is "normal" or abusive

My current company has a rotating week of oncall. Happens every 2 months or so. Oncall gets paged first and is expected to be available 24/7. But if they escalate further, whichever dev or manager it gets escalated to is expected to be available 24/7

By 24/7, I mean, they don't tell you that you're allowed to sleep. They just fired a manager for not being willing to wake up in the middle of the night for pages

Edit: also a bunch of people on our team think it's normal to ping and ask for help on non oncall stuff outside of business hours (like 7 or 8)

Edit 2: I forgot to add, we are not paid anything extra for oncall (or any additional work time outside of business hours). It's salaried


Well, my data is over 10 years old, but I know this policy because I had to pay for it out of my budget at a former employer: 1) Off-hours pager duty rotated among the team. 2) If you were on the pager, you needed to be able to get to the server room, clean and sober, within 30 minutes, although remoting in to solve the problem was perfectly fine if you didn't need hands on the hardware. 3) If you were on the pager, you got paid 25% of your normal hourly the entire time you were on the pager because your personal life was limited -- fresh snow in Tahoe? It sucks that it is more than 30 minutes from the server room -- swell party? Make mine cranberry juice. 4) From the time the pager went off, until you cleared the problem, you got your regular hourly + overtime + applicable shift premium + applicable holiday premium.

My philosophy is that if an employer is infringing on your personal life, they need to compensate you for that.


Love this. As another humane anecdote I always insisted on call includes a manager so they feel the pain. This is so they remember to treat people with respect AND so they listen when the folks getting paged say something technical needs to change in order to increase stability.


It's expected that you will wake up for pages and figure out if the issue needs an escalation or if it can wait until the morning.

If the oncall is too heavy, you should have split oncalls for day and night.

> But if they escalate further, whichever dev or manager it gets escalated to is expected to be available 24/7

This doesn't sound right, since that effectively means everyone is always oncall. You should clarify what the policy actually entails.


It seems like it's vague on purpose and whenever I ask my manager about it he just sort of non answers (pretty sure he's afraid to ask the skip manager)


Abusive. Are you giving away your time for free too during these rotations? Bonus, overtime, equal time off?


I forgot to add they don't pay us anything extra for time outside of business hours. It's a salaried position


Free labor that you're handing away. Saw your edits, sounds like you've got start carving boundaries in stone and letting colleagues know it's their loss and fault for not eyeing timezones / after hours.


thanks for checking the edits, personally it's probably going to be easier to pack my shit and go somewhere less bad


If there is something that you would like to do in your own time that you can’t do because of a work requirement, your work should be compensating you back in time or money. Can’t go drinking / see a movie / go travelling for the weekend? They should pay you for the inconvenience. Actively working on a production issue due to catching a call? They should be paying you a multiple of your hourly rate. If they’re not a charity, you shouldn’t be donating your time and energy.


Of course. Did anyone say otherwise? On-call is typically compensated for (with paid time off and/or money) on top of regular pay, with extras for any time spent on responses to incidents.


It's salaried unfortunately


Yes, this is abusive as your compensation doesn't grow with your time working. Your salary is based on an x hour work week. This system incentives poor software quality at the cost of your and your colleagues' time. My role and those of my direct colleagues is also salaried similarly with expectation of some overtime without additional comp, but that should be a rare hour or two. I do know of/work with other companies that have similar terms around on-call as yours. My employer does only 1 day at a time per engineer with a repeating 4 week schedule but we also have a support call center taking care of the majority of calls and only escalating to on call when necessary.


I’m salaried and I still get extra payments for being on-call, plus call out payments when I catch a call. Don’t work for free.


Yes. That is abusive.

Also generally if you’re ever asking if something is abusive, the answer is usually yes.


Anyone on-call or off that is woken in the middle of the night to attend to a page can take a paid 1/2 to 1 full day off (without recording it as PTO) the week after they're on-call. (amount depends on how long they were disrupted)

This is not official policy, but it's been the in-practice unwritten stance of every manager in every company I've worked at.


It's not far off my company on-call policies.

As a senior+ IC I am in a 6 week rotation as the escalation path for 6 teams. Any page goes to the team's on-call + whoever is on-call for escalation that week.

Every few months I will be paged when I am not on-call as a SME, and those responses are optional, but frowned on if you miss and don't have a good excuse.


I’ve often heard the advice for on all to focus on triage and call in support for big problems.

But… doesn’t that mean that everybody is technically on call? There the main person answering the pager, but if the expectation is that they can pull in reinforcements as needed, that means everyone should be ready to get pulled in to action at all times.


If the expectation is that the on-call person should fix all the issues that arise during their shift, you either need a very well defined runbook, or can only have people on-call who have deep understanding of the whole system.

I guess that's a model. But every runbook I've seen has a clear call to escalate if the conditions don't seem to match.

Sometimes the runbook will have procedures to disable things until the business day, in which case you don't need to page anybody, but the service will be degraded until the responsible party can manage it. If the procedure doesn't work, someone will get paged.

IMHO, the most important part of a runbook is the escalation process. And probably the most important meta task of an on-call rotation is tracking escalations and ensuring they're dealt with.

Norms depend on your business, but if you get a lot of escalations outside of business hours, you either need to fix your stuff so it doesn't need escalation, or you need to staff your stuff so escalation is to people who are in their business hours.

Edit: I'll also add that reducing incident frequency is good, but when it drops from once a quarter to once a year, new hires won't get osmositic training anymore. When it drops from once a year to once every other year, team muscle memory will have atrophied. It's worth doing some periodic training/refreshing when things are running well.


Fully agreed with all this.

Also, if there's a bottleneck where an oncall needs to rely on a teammate with more experience with the subject matter, then make sure that's noted down in a retro. Hopefully an action item can be made up and completed where said person does some knowledge transfer, at least into a run book or other documentation.


Is it worth having psuedo production services that get chaosed monkeyed into another dimension at a random time and pager alerts on them.

Effectively: a drill!


I don't think chaos monkey works for incident drills. Anything the monkey can do is going to be easy to detect (probably).

You can do some amount of drills with periodic disaster recovery tests --- twice a year do a manual failover of a colo, etc.


Expectations should be lower as far as responsiveness or even availability, for someone who is not actually on call. The load (and expectation) is also not evenly distributed: IME senior and staff/principal-level engineers (and managers) tend to get paged in when off-call much more frequently, for obvious reasons. It's more likely to be "I need someone who knows XYZ", not "I need absolutely EVERYONE" https://www.youtube.com/watch?v=74BzSTQCl_c or "I need a random additional pair of competent hands".

Also, IME it's been relatively rare for issues outside of business hours to require calling in people who aren't really on call. I think the article is pointing out that it can be the right thing, not that it's necessarily a common scenario. And during business hours, being pulled away from your other work to help handle an incident is obviously a much easier pill to swallow.


There’s a difference between on-call being in your job description and occasionally responding to slack messages to help out during an incident off hours.


On call may page teammates for help, but they might be on airplanes or go camping or do other things that take them off the grid (primary and secondary must not). I would really hesitate before paging someone who's on vacation, but he would probably have his phone (not laptop).


Having to page someone on vacation is a very very broken organization.

Additionally, paging someone when they should be sleeping is also abusive.

If you need 24/7 coverage, pay for follow-the-sun.

Most of what we do isn't actually that important.


> Having to page someone on vacation is a very very broken organization.

I agree, I'd like to see enough written down that no outage ever has a bus number of one. But I haven't been seeing that anywhere. I've resorted to this one time ever, and the super senior founding teammate was very engaged and assured me that it was the right call.

> pay for follow-the-sun

This seems likely to create a huge team of devs who are seen as interchangeable, no longer paid amazingly well, and don't have enough to do every day.


> Additionally, paging someone when they should be sleeping is also abusive.

My current job does this, they expect you to respond to pages at 4am

You're telling me this isn't a thing other places?


Seems pretty normal from two Bay Area startups and two FAANG-sized orgs. Primary should respond, secondary shouldn't be disturbed unless primary seems incapacitated (no pager ack) or is at wits' end.

Edit: I should add that the secondary gets paged more often while the primary is new to the team and doesn't know how to fix everything. In return, you go on call 1/n less often in the future.

If I need to sleep in after a bad night, it's always been fine.


Depends on the size of the team. Startup or small team? Yes. Everyone is on call all the time. Large number of developers? Someone on every team is on call all the time, and leads need to be almost always available for large outages.

On call pretty much just comes with the job, and always has.


I suppose for the vast majority of software engineers working on online / SaaS type products or ones that silo a lot of customer data, this is true.

Always has is a bold assertion. I've worked for companies which produced consumer level software on an annual cycle that was pressed to physical CDs, and there was not even a concept of on-call. Bugs that got reported went from customer support, to QC to corroborate, and finally triaged out to the R&D department where they would be fixed within normal work hours.

This idea of 100% 24/7 on-call to fight fires in an industry where the vast majority of engineers are working for insurance companies, social media, e-commerce, etc. This ain't life and death people, let's get some perspective.


> produced consumer level software on an annual cycle

This can also be generalized "produced software on a release schedule".

I would assume that the vast majority of software engineers are not working on supporting the operation of online/SaaS services, but rather develop products.


> On call pretty much just comes with the job, and always has.

Maybe for you but not for everyone and I bet outside Silicon Valley startup land and certain industries it is probably less common than you think. I work in government which is basically 8-5 local business hours. Production issues can take days, weeks, months to fix and deploy depending on priorities. Most of my dev friends have never had on call roles either. Plenty of companies have enough staff to have around the clock coverage. Just trying to add an additional perspective.


> On call pretty much just comes with the job, and always has.

If you don't remember the invention of "devops" that's especially true . . .


Those who do not know history are doomed to repeat it, or words suchlike (from Santayana, I guess?).

I don't know if I'll ever see things like devops and agile die the horrible deaths that they deserve - but I do wish engineers would at least learn to think for themselves and not drink so freely of the kool-aid that CEOs peddle.


Sad part is that devops was never meant to be a title, just a way to work together effectively as a team that included developers, qa, ops, pm, etc. Devops was much like agile, they were great ideas and ways to work, but then got cargo culted to death and today managers have taken them as buzzwords and thrown away all the stuff you actually needed to do to get good results.

Management always takes good ideas and extracts the absolute worst stuff from them, if they don’t just make up shit on the fly that wasn’t even a part of the original good ideas.


Yes indeed. Management almost always bastardizes good ideas and makes them terrible; and then they take it a notch further by finding and nurturing kool-aid connoisseurs in the levels below.

(Edit: grammar)


IME triage should mean they can stabilize things long enough no one else needs to be woken up. Ideally they could address further during normal business hours.

Reinforcements may get pulled off planned work, but only as a last resort, and only during business hours. Unless the situation would kill the business and the triage isn't enough.

Strategies like automated disaster recovery processes (yet with manual initiation), coupled with rotating who walks the DR plan during the periodic practice, can mitigate the absolute worst case scenario.


There used to be a seperate job for this and they dumped it on startup engineers (otherwise failing businesses) and now they dump it on all engineers.


On call is kind of like Open Offices layouts. It was created to reduce costs for companies, at the sake of abusing employees.

Whatever happened to hiring people in schedules that will cover the 24/7 shift.

Manufacturing companies don't have "on calls" they have 12/12 shifts, or 3x4 shifts.


I did on-call in the 90's. I wouldn't do it today on a regular basis, I need my sleep more than I need any job.

I was a DBA back then. The author of the posts talks about calling for "back-up", back then they would page me (still just pagers then) and say the database was down. Most of the time the database wasn't down, and there was no real evidence the database was down, they were just calling for "back-up".

I've worked at many jobs since then, including at startups, never did an on-call rotation at any of them since the DBA job. If it's important, you should probably have fully awake people scheduled to deal with it.

To me that's different than supporting a release of code that you wrote. I mean if we really have to do this release at 9 PM and I'm the guy who wrote it, I'll probably show up, after I slept all day. But I'm no longer your database buddy that you commiserate with during the night when everything goes to shit since no one else will answer the pager.


Pager Duty gave me insomnia, it's not worth it. Tell your organization they can hire someone to work that shift.


Relatively new to on-call (10 months) and it has ruined my sleep quite a few times, plus a bit of fear during my weeks (every other).


I really disagree with involve other people.

This implies that everyone is effectively on call 24/7.

You have a primary and secondary. No one else should be paged unless they’re on the rotation at that time.


This is probably dependent on the size of your org... yes, we have a primary and secondary, but that is for each team... my current midsize company has 10-15 teams that each have their own on call schedule. So the escalation is like "ok, we need the network team for this one, please page the on call for net"

It isn't about calling in everyone on the same team.


Yeah that is fair. And I’d put that under paging someone who is already on another on call shift.

What I meant was like, paging a subject matter expert on your own team.


On call means different things for different companies. We used to page for non-emergencies. But we eventually changed it to page for actual service outages or core metrics shitting the bed. If one of those two aren't happening, it waits until morning or Monday.

Or, maybe, if you're large enough... hire night shift people. I have friends who cut their teeth on night shift ops.

I have friends who work more on the sysadmin side of things, and on call for them just seems like extra work. They're glued to their laptops answering requests.


This is a great timing for me for this post. Lots of good if somewhat obvious advice.

I'm considering picking up on-call duties in my new role. In my last company they expected us to do on-call as part of the job but only during what they defined as "working hours" which didn't fit with my schedule. That was one of the reasons I left that particular role. But here they give an 18% salary boost for the on-call duties, and I love debugging hard production problems which is a huge plus.


It was a dreaded feature of one job I had way back when, supporting a server for static trading data (counterparty info, and other stuff) for sites in London, Hong Kong and New York. We all hated it, until one day one of the guys "lost" the support laptop on the Tube. We then did a bit of scripting and fiddling with permissions so that the guys in HK and NY could fix all common problems by "turn it off and turn it on again" magic.

Bye-bye support trauma.


If a company wants systems up 24/7 they should hire three shifts of people to support it.

Not willing to pay for three shifts? Shut the system off.


This is a pretty good article.

Similarly to the "Heroism isn't…" section, I'd say: Breathe. I've been asked "how do you stay so calm when something is going wrong?" and the honest truth is I'm scared! Or at least, I have that pit, in my stomach, going "oh no it's not working, will we figure this one out?" It's just not a useful thing. Tell that fear to take a backseat, and attempt to let the more logical side of you problem solve. And like TFA says, call for help if you need it; two minds are better than one.

At the management level, you can also do a sort corollary to basically everything in TFA too: "call for help": your engineers need to be able to call for help. That means retaining experience, so that the younger engineers can learn from the older ones, and hopefully not trial-by-fire their entire career, and have someone they can fall back on for help. Same goes for the experience devs, too: it means you need two experienced devs. I've been the only experienced person on the team, and it sucks, because I don't have the answer to everything. "It is your job to see that issues get addressed." — at the management layer, you need to make sure the incentives are focused on that, not something inane, like "mean time to resolution". Time to "the incident in PagerDuty is closed" is meaningless, and will be gamed to something like "we closed the incident because the immediate instance of the problem / symptom has been dealt with". You want the actual, underlying root cause debugged and fixed, and ideally, that eng should never see that entire class of problem again. But this means understanding the root cause, and understanding the system well enough to see the problem through to conclusion, which often means things like "ok, this needs to be fixed, *and I need to prioritize someone familiar with that portion of the system to fix it" — and all too often, that follow-through just doesn't happen. And when it doesn't, your eng pays for it, in the form of getting woken up. "Don't sacrifice your health" — are your eng sacrificing their health? Is your on-call experience too often? (At the lowest, I've been oncall 100% of the time. That was too often!)


What I tell people new to on-call: "Quit. Find another job."

I won't accept jobs with on-call rotations anymore.


Worked for an eCom store for 3 years, was on-call 24/7 for most of it due to understaffing. In this context, every second of downtime was actual money being lost. CEO drilled in the gravity of each outage.

Took me a good few months after changing jobs to not get crazy anxiety every time my phone rang.

After working in a much healthier on-call setup later in my career supporting a large SaaS, I actually really like it. High stakes produce quick learnings.

Not for everyone, but everyone should try it (and be compensated FFS).


That sucks: CEO knew the gravity of money being lost but couldn't hire 3 people to cover a 12/12 shift?


would you atleast be compensated for those on-call shifts?


The handwritten text is not legible


I can read it fine?


Agreed, the "XKCD imitation" style doesn't really work when its one-step above writing a paper with MS Paint.

They'd have been better off using a hand-written comic style font like Blambot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: