Side note -- I know a lot of early YC startups like to play things fast and loose, but you really should compensate your engineers for after-hours emergencies if they are already working 40 hour weeks. Morally and employee-retention-wise it's the obvious thing to do, but beyond that in certain states and jurisdictions you can easily run afoul of local labor laws if you try to require employees to do things outside of their regular hours without compensation. Another very important angle to consider is you want to incentivize your employees to answer the call to deal with these issues. If there is no incentive structure in place, they might not take it seriously, which in itself can be a security and stability risk that could threaten things like your SOC-2 compliance, if you have it.
When I worked as a full time DoD scientist, there was a very set-in-place system for dealing with these situations, and it was normal to pay double overtime for any hours spent by an employee on after hours emergencies. This is the right way to do it. Pre-series B startups largely don't do this, but once you get to series B and C suddenly it becomes a thing because companies realize they either have to legally, or have to to prevent their employees from churning and to protect themselves from people not showing up to put out the fire.
Just do it, and do it early. Do it before Series A. That's the advice I give my consulting clients, and the approach I take with my own companies. By compensating your employees for this time you also take what would be a red flag for many would-be employees and turn it into an exciting perk.
Yes, and even more importantly: compensate your engineers who minimize the need for these kinds of heroics. Ones who:
* fix the alerts so they reliably page when there's a SLO-worthy problem and only then.
* test the restore system so it works smoothly when needed at the necessary scale.
* add safety checks to prevent the need to use those backups in the first place.
* get to the root cause of yesterday's outage and prioritize the 9–5 engineering work to ensure it won't happen again.
It's awful to work for a place where heroics are often necessary and unrewarded. I still don't like working in a place where heroics are often necessary, even if they're celebrated. They're often avoidable.
The trick is to make sure the heroic efforts are rewarded quietly, but the unheroic ones create lots of noise.
This is inherently difficult because the heroic efforts tend to come as a result of very noisy incidents - everyone already knows the database cluster was literally on fire and wants to hear how Jane put it out while simultaneously remagnetizing the backup drives. The lead's job is to make the noise ensuring everyone knows Morgan's the one who did the boring rewiring of the backup-restore process to automatically sync and fail over to another AZ when thermal metrics start trending bad.
One thing I've noticed as we've been trying to integrate some on-call processes across formerly separate teams is that "our" incident reports were about 25% timeline/responses/impact analysis and 75% RCA and plans for future mitigations. "Theirs" were the opposite. Our incident rate has trended down sharply over the past two years even as our system grew; theirs has scaled up roughly at the same rate as their service count.
Avoidance isn't the goal of mgmt though. Cost minimization is, and by quantizing the compensation portion it can help figure out the cost of a bug. Ultimately the cost of a bug is a key variable in the decision to prevent or solve.
The cost can easily vary from $0 (eg at a startup with no customers) or to millions (maybe billions?) when you consider the risk of brand damage, lost ARR (multiply that ARR by the sales multiple!), decremented velocity (lost opportunity value), and eventually even recruiting cost (to replace frustrated employees).
I've always wanted to work at a JoelTest[1] company which fixes bugs before writing new code... But the closest i've found is a commitment that the oncall engineer works on bugs for their rotation to reduce the bug backlog ( and is not part of sprint velocity)
Management's goals are easier to achieve when they can employ excellent engineers. As I explained, I don't want to work somewhere heroics are often needed. I'm not alone. That's a factor they need to consider, in addition to the customer impact of the outages themselves.
What I find interesting is that if you give employees a choice of 100 units of currency per year as salary plus 1 unit of currency for each week on call with an expectation of 1 week on call every four weeks versus 115 units of currency for the year with the same on-call expectations, some employees will feel better about the first arrangement and some will feel better about the second arrangement.
Some people want to see everything broken out and take a very transactional approach to work; others prefer the simpler approach but with less clear linkage to a specific piece of work.
The problem is once you consider it part of your standard comp, you take it for granted. If emergencies almost never happen and then one does, you will inevitably feel cheated when you have to wake up at 2 AM on a Sunday. Much better familial optics if you can tell your family "hey we got an extra $500 for that" versues "oh, dealing with these is included in my take-home salary".
Even worse, if emergencies happen all the time, your family is going to hate you if you don't have some reward you can show for each one, especially if there are other jobs at your salary range that don't have an on-call clause.
I'd say the issue for people who are inclined to pick the first arrangement is one of trust.
If you don't trust your employer not to abuse the second arrangement then you'd be a fool to accept it. With the first arrangement the employer is incentivised not to exceed the agreed upon expected number of on-call weeks because they are financially penalised for additional on-call time. With the second arrangement the employer can demand as much on-call time as they think they can get away with.
It's really a case of if you expect your relationship with your employer to be in some way adversarial - if you trust your employer not to screw you than go ahead, but you can't necessarily reliably make that judgement going into a new job.
I am firmly in the camp that would pick the first arrangement, but not because of (lack of) trust. It is partially due to psychological effect on me (get out of bed, you are compensated for this) and partially because the incentives of the employer are better aligned with mine. They should view 24/7 operation as something that costs money, so that they are more willing to invest in robust solutions.
What do you mean by "expectation"? One day on call, one day paid overtime; anything else is abusive.
There should be no incentive to require or avoid "free" work, only advance planning of who'll be on call (with more available engineers doing more turns).
I do. It means that time is not mine to do with as I like. I cannot spontaneously go out of town, nor can I decide to have a few drinks. I cannot start a project that needs to see completion - this includes simply involved dinners, hobby projects, and sometimes home repairs and car maintenance. If it is at night, I have to sacrifice sleep quality for being able to hear the phone.
In other words, they are working hours with some degree of personal freedom during the day. The assignment is to be ready in case things go wrong.
But e.g. for defense contract stuff, I've accepted 1/4th my normal pay rate to be available for a span of hours (and then 150%, 2 hour minimum to actually do anything).
Yes, I can't drink, or become unreachable, and need to be able to get to a computer within a few minutes. But I can read a book, watch TV, work on hobby projects, etc.
Said differently, if the employer is going to pay you overtime rates for the period, why wouldn't they just have you actually work overtime, prioritizing a production outage over regular development, but regular development over watching TV?
If you're having to actively shape your real life to accommodate your work life then you should be compensated (Not attending events / bringing your computer with you to dinner, etc). Do firefighters only get paid for the amount of time actively fighting fires?
Again, I didn't say that you shouldn't be compensated-- just that you shouldn't be compensated at the rate if they were having you actively doing work the entire time (else, why not just have you actively do work the whole time?).
> Do firefighters only get paid for the amount of time actively fighting fires?
I believe firefighters tend to do 24 hour shifts, where they actively work 8 hours and are paid full rate for that (during which time they have responsibilities in maintenance, etc)...
...and are paid a reduced rate for the other 16 hours they are on standby. (Here, in my jurisdiction, it's 10 hours + 14 hours). During this time they hang out at the firehouse, but may be eating, watching TV, sleeping, etc.
Many orgs do comp their employees for the entire time they are on call. DoD does this for on-call employees due to past labor law issues, so it probably is legally required but rarely enforced.
I would advocate for 100% but 10% is way better than 0%. I didn't have any on-call peers during my time in DoD so I'm unfamiliar with the structure for that arrangement, but very familiar with regular overtime stuff with DoD.
10% of time-and-a-half (15% of normal) seems reasonable for time that you are expected to be able to do your own thing but stay available. I negotiated 25% of normal pay for similar things.
24 hours of call at 15%, where they never call you, becomes a half day of pay. If you get a call, whether brief or using the whole 2 hour chunk, you're basically paid for a full normal day. This all feels just and reasonable to both sides.
100% of overtime rate doesn't make sense to me because then they're paying just as much as they'd pay to have you actively work the entire time.
Another side note, doing a 4-day work week for engineering is a great and productivity boosting practice, but it shouldn't be seen as a carte blanche to not compensate them for after-hours emergencies. If you have a 4-day work week, then 32 hours a week is their normal full-time number of hours that you are compensating them for. When you ask them to go beyond this, you need to compensate them for their time.
Unless you're in a really dysfunctional startup, the employees who voluntarily jump in after-hours to fix OC issues are going to be very quickly compensated with raises and refresher option grants. Far better payout generally than a few bucks of overtime.
It becomes really obvious who is holding the company together and who is coasting, relying on the "senior" (ie, anyone who puts in the effort) engineers to keep the ship afloat, and in a highly competitive engineering market, that will definitely be reflected in equity comp.
I respectfully disagree. I did exactly this for a number of years. A colleague (and is now friend since we've both left said company) made it clear he would not work overtime without compensation (and never did), but his day-to-day work was excellent. He knew he had a skill they required, and was good at it (and our skill sets were the same).
Our careers both grew at about the same pace for four years. He became known as the go-to guy for green-field projects, while I was the guy you could put onto over-budget tight-deadline projects to rescue them. I worked a lot of weekends for my troubles, and I always envied how he ended up on the "fun" projects. I learnt a lot from him. As a professional, some self respect is required, or you will be abused.
The problem with informal "people will notice" reward schemes, even when administered well (and don't take that for granted, it's easy for them to become popularity contests!), is that they encourage bad work-life balance among more junior employees. If you regularly have people popping up at 10 PM to fix things, and you don't have any formal recognition of the fact that they've gone above and beyond, new hires who want to get ahead will learn that working late into the night is the way to do that.
Going above and beyond what your fellow humans are willing to do is a pretty tried-and-true way to get ahead, whether you’re working for yourself or as an employee.
Visible to who? Certainly the request to work outside of their normal schedule, and the report that an issue was resolved are seen by more than just the person responding to the request?
2. There are few pages, preferably the median should be 0 per week.
3. Spurious / non-actionable alerts get fixed right away (with very high priority)
4. You're not up more than 1 week per 1-1.5 month.
5. You subtract middle of the night pages from your next working day, with bad nights resulting in a day off. Being on-call doesn't mean working overtime.
As with most things, the core idea is not bad, it's the execution that matters.
Being on-call while you do not get called upon is 24 / 7 work because you have to live your entire life around being available.
Like the blog post mentions, grocery shopping has implications because if you happen to have an on-call event while shopping it means leaving your cart to run to your car to address the situation because you can't leave your house without your work laptop.
It means never going to the beach alone on a Saturday because while you could bring your laptop with you, if you go swimming and there's an on-call event you can't address it because you have no way to get notified while you're swimming in the ocean.
It also means going to the movies with an expectation that if you get called 15 minutes into the movie you're leaving. Likewise, if you're mid-date and get called you're out of luck.
It means never enjoying being able to walk around while being disconnected from the world. It means if you're at your mom's funeral giving a eulogy you leave mid-speech to address PagerDuty.
Then there's knowing at any given second your phone can notify you of an event and you have to put the volume at maximum and place it right next to your face every night with an expectation that you could be woken up at any second.
> Then there's knowing at any given second your phone can notify you of an event and you have to put the volume at maximum and place it right next to your face every night with an expectation that you could be woken up at any second.
And just to pile on that even more: also dreading that you might not wake up if the page/call comes during the heaviest hours of your sleep.
It's happened to me, I felt awfully guilty for not waking up after a gruelling work day and some pages early in the evening. I was tired and had been asleep for around 2 hours, didn't wake up and the escalation policy took it up to my manager... I didn't get reprimanded or had any bad consequence from it, still the guilt made me feel like a failure and increased my anxiety when I'm on-call.
As someone who is second/third-tier on-call for most of the year: You know what's worse than getting paged when first-tier doesn't handle it? Not getting paged.
The only person I ever had to formally reprimand for on-call policy wasn't the one who failed to answer on time, it was the person who, three times, acked and didn't say or do anything else only because they didn't know how to fix it. In all cases I ended up getting pinged by someone else (head of our support team, or a colleague working an odd schedule), after the situation significantly worsened.
Higher tiers are there for a reason; don't feel bad if you've used them, only if you're expecting to use them.
(On the other hand, feeling guilty for making a mistake is also human, and to some degree something that makes me want such a person on my on-call team. People who feel better when they do a better job will do a better job!)
I would humbly suggest that if this person didn’t do anything, either they needed to be better informed of expectations or that your culture needs to change so that they don’t feel the need to ack without actually doing anything.
>still the guilt made me feel like a failure and increased my anxiety when I'm on-call.
I don't know if this will help with your mindset at all, but as someone who's been listed as the secondary if the first responder doesn't answer: I really wouldn't beat yourself up over it. Self-correction is fine, but try not to let it drag you down too far. We have the same worries too. Our worries are usually "Crud. If I don't pick up, then it's either going to (big boss) or the customer is going to be firing off nasty emails in the morning and dragging me into some stupid meeting." The cause of why we're being paged doesn't really come into play. At most, I might just send a text or whatever to the first responder, to make sure that they're okay. "What if they aren't responding because they were in a car accident?"
If you have a good manager - and it sounds like you do, since you weren't reprimanded for performing a fundemental human function - then they are probably doing their best to look out for you and your interests. When people talk about "not being the boss" or "working as a team" or whatever, generally what they're trying to get at is that "They have your back". They've been in your position and have probably felt that same crush of emotion and worry that you might deal with. They are meant to be the final filter, in a hierarchy of filters, to protect you from those outside elements which ruin your ability to work effectively.
If you put in a sincere effort at your job, you have nothing to feel guilty over. That might be easier said than done, but it really is true. It just might take some time and practice to be able to forgive yourself when you make a mistake.
And if you feel like you could be doing a better job at this, or at that, then you just need to permit yourself the time to improve on that thing, and to remind yourself that you are allowed to mess up from time to time. It's about how you learn from those things.
+1. I am a deep sleeper and Ive missed pages at times that got escalated to the secondary. Nothing major, just something that autoresolved. But the guilt I felt was… incredibly stressful. Because I felt I was letting down people I was familiar with.
You're using "never" everywhere here. That is in my opinion the main red flag here.
On-call should be at most one week in four-six. Moreover, with a healthy on-call culture (where stuff is fixed, and alerts happen rarely in practice), usually you can pass/swap on-call to others for an evening, or for an afternoon, or for a weekend, as almost always there is somebody who's plan is "sitting at home" and nobody minds having the pager in such circumstances if it almost never pages outside of working hours.
almost always there is somebody who's plan is "sitting at home" and nobody minds having the pager in such circumstances
I wonder if this is an American attitude that exists primarily because we (collectively) have allowed our employers to demand this of us?
Personally, I hate being on call. Most weekends, I spend at least 6 hours cycling. Sometimes significantly more. I go camping regularly. I go kayaking regularly. If I don't spend time outdoors, my mental health declines (really, ask my wife, she kicks me out if I sit around the house too long because I turn into a cranky butthead). Most evenings after work, I run/cycle/hike. I have to run errands. Walk the dog (usually 2 miles). This is all part of my normal "not doing anything special" time at home. None of which is easily doable if I'm on call.
Fortunately, I've managed to build a career in a place where on-call rarely exists.
I believe the point was more than if you have a team of, say, ten people, there's always someone who is not busy for some given night, and there's some reasonable trade that can be worked out, the moreso if oncall risk is considered low by the team. Obviously as you scale down that becomes less true.
IMHO the whole notion of "week of on-call" is ridiculous. Being on-call ready is a shift of (hopefully low-intensity) work time. You can do 8 hour shifts, 12 hour shifts, 24 hour shifts, but you can't do 168 hour shifts for monitoring something, anything - that's impractical and should be illegal.
There are (or should be) some minimum standards of rest time that every human being must get, and being on-call is not it, every employee must get an opportunity to fully disconnect for non-trivial time during every single week. You do your shift, and then get at least 12 hours off (with your phone off) before being available for work or work calls again.
On-call is meant to be for rare emergencies. If you're getting paged outside of work hours on each 1 week shift then something is wrong. It should in no way be equivalent to 168 hours of work.
If there is an obligation to respond within a certain time, and an obligation to avoid certain activities (whatever you imagine that makes you unable to respond) then those are work hours even if nobody calls. It's exactly like a fireighter shift or call center shift where there happen to be no incoming calls, or a night shift at a remote gas station if no customers come during that night.
And if you're getting paged outside of work hours, then there's zero obligation to answer your phone.
There's no middle ground. If the employer says that these aren't working hours, they have zero right to ask what you're doing at these hours, much less put any conditions on it; it's your right to spend that time fishing in a remote lake with no cell phone service or go on a date or whatever and not even explain anything without any reprimand when you arrive at your scheduled start of work time; and if they want you to commit to a shift, well, "a shift" where employer tells you what to do (e.g. do not go fishing to a remote lake and don't sleep for 12 hours with your phone off) is by definition work hours.
It is quite plausible that most of the times most employees will pick up the phone and solve reasonable issues, but that's an extra courtesy from the employee, going beyond what you can demand or expect; but the moment you start to require that, or ask an employee to precommit that they will definitely be monitoring their phone for rapid response, that means you're effectively assigning those hours as work hours.
It's understandable that being on-call can be very light work in many cases (not all - quite a few counterexamples in this discussion), so you can agree on different compensation for them, but those definitely are work hours (they definitely aren't non-work hours, and there is no middle ground) and thus any rules on length of shifts and rest between shifts can and should apply also for on-call hours.
If someone works an on-call rotation and was told about it during the interview process then I see no issue with it. It's baked into the agreed-upon compensation already.
I don't see how you can declare what an employer/employee are allowed to agree to. Working an on-call is completely reasonable for a salaried employee. If the terms changed after you were hired and you're bound by a contract (and cannot leave without penalty for a certain timeframe) then things would be very different. But the majority of the HN audience is in the U.S. where employment is generally at will.
> I don't see how you can declare what an employer/employee are allowed to agree to.
Okay, I am coming from a non-US perspective where it's obvious that you can declare what employer/employee are allowed to agree to - "employee rights" means those things which are unalienable and nonnegotiable.
I'll simply quote the universal declaration of human rights "24. Everyone has the right to rest and leisure, including reasonable limitation of working hours and periodic holidays with pay." - not only do countries have the right to intervene in employer/employee contracts, they have a duty to do so. If local employment law permits companies to require salaried employees to work 168 hours a week, then that law is literally enabling violation of human rights and should be changed.
and that's great in theory, yet incredibly vague. I doubt it's targeting HCI that are predominantly on-call: medical professionals, infrastructure engineers, programmers, etc. I picture it as critical of 996 working hours and similar.
In any case I suppose we have different expectations on what's reasonable. If you think a handful of pages per year from my consentual employer is infringing on my fundamental human rights then there's not much I can say to convince you otherwise.
I'm looking at this exactly from the perspective of jobs like medical professionals, infrastructure engineers - I have worked in power infrastructure and have relatives in medicine - where locally none of them are on-call (because the law proscribes this clear boundary); they do 12-18-24 hour shifts of work time, and once the shift ends, they can be (and often are) unreachable. It's not vague, it's extremely clear and simple. Even in a low-volume alert-monitoring position for infrastructure where you might get a few incidents per week (so zero incidents on a median shift) every shift is work time, and scheduled appropriately.
It's not about the number of pages you get, but about the differing expectations on boundaries between your employer and your private time. I do have a strong expectation that employers should not be able to set any restrictions of what people will do in their non-work time, or even ask whether the employer will be out-of-service this weekend, in my opinion that is crossing a line.
As I said, it's entirely reasonable if you get a handful of pages per year and service them - I also have been in roles where I got a handful of night calls per year and happily did what was useful, and I'd expect that in most cases that's okay for most people. However, if it becomes a requirement where I or you must be "on alert" for a whole week, then IMHO that's not reasonable anymore; I answered those calls but I had no duty to be ready for them and abstain from activities that make me unreachable, and if I had a contractual obligation to do so that would be unreasonable (and also a void, unenforceable clause violating employment law). And if you did not volunteer for it but an employer "pushed" this unreasonable requirement on you (e.g. as a condition for employment) then I would actually say that this violated your human rights even if you got zero pages per year; employees should have the practical right to freely decide how they spend their non-work time during the week (and they should have appropriate non-work time) no matter what contracts they sign.
> where locally none of them are on-call (because the law proscribes this clear boundary); they do 12-18-24 hour shifts of work time
24 hour shifts? Sounds like a real worker's paradise. I'll keep my 7-8 hour days with bi-monthly on-call over that, thanks. I'm glad you aren't legislating where I live; I am much happier with the choices and trade-offs in my life than what you or any euro-crat could dream up to "protect" me.
> if you did not volunteer for it but an employer "pushed" this unreasonable requirement on you (e.g. as a condition for employment)
You have a very different concept of what voluntary and self-agency mean than I do.
The company having that much influence over your day-to-day life during non-business hours every fourth week is an enormous burden. I would want at least 50% higher overall total annual comp to even consider that schedule, and even then it's still only a maybe.
having resonable response times make a huge difference in being oncall.
in my experience, most employer seem to think responding within 30minutes or so is doable, and i tend to agree with them.
Usually makes it possible for you to just do things around the house/ in your life, and not having to rush back home when you run into an oncall issue.
In our case it's one hour, essentially based on "what if my home internet doesn't work and I have to get to the actual office to start responding, how long will it take me to get there from anywhere in the city." You don't go on vacation or plan your wedding, but otherwise your day is pretty normal.
I've definitely seen on-call compensation implemented differently than stated in the collective agreement. And not in the positive direction. Thankfully it wasn't mandatory, so it was a very definite possibility to not opt in.
Also, only on on-call for things you're actually responsible for. I hate having on-calls span multiple teams. They can then be lazy, as their issues hurts someone else. And the added stress of having to debug and fix shit you're not comfortable touching.
Additionally, I'll never accept on-call with 15 minutes from alert to being on a computer again. It's just too limiting and disruptive of my life. Have to bring the computer everywhere. Any dinner or social event can be instantly ruined. A workout becomes meaningless. I remember doing a swim and having to check my phone every 5 minutes. It's mentally exhausting and frankly not worth the pay.
If it's in the contract you sign, then it's already priced in. I don't see much value in specifically outlining which part is base and which is for oncall, if oncall is mandatory.
Because if you are supposed to be prepared for sudden work outside your normal work hours, you should be compensated outside of your standard pay. Period.
And sure, it is priced in because the companies can get away with it. This doesn't mean it is right or fair as the price is always going to be in the company's favor and rarely fair to the employee. For example: My brother worked for a US railroad. He didn't have a set schedule. Instead, he got 10 hours rest after a shift and then he was on call. They only closed down on Christmas and New Year's. You were expected to do this on-call work perpetually. The money was good for the area as were the benefits. They advertised in depressed areas without much opportunity, so it made it easier to prey on folks that will accept the poor treatment. I'm pretty sure having to pay folks for each of those hours would change the behavior of the company in the employee's favor.
So yeah, even one hour of 'on-call' should be paid extra, outside of your basepay. Even for companies and industries that aren't being actively evil so that it doesn't happen in the future.
Surcharge for on-call activity is an incentive for the organisation to minimise on-call. If I ask my boss for resources to automate operations, he's going to measure that against the cost caused by on-call activity. If there's a flat-rate cost for on-call, the organisation has one incentive less to improve operations and reduce incident count.
My contract says that if on-calls are needed, I might have to be in the rotation. This clause increases my pay rate even if I'm not on-call.
If however I am actually on-call, I am paid more. And if the on-call rings, I'm again paid more on top of the on-call period. And as the French law mandates 11 consecutive hours of rest, if the on-call rings in the middle of the night, I'll usually come to work later the day after.
If you don't have advantages for being on-call, you're the one being taken advantage of.
In a US state with at-will employment, on call can be added to an existing contract without additional pay - continued employment is considered to be sufficient "consideration" on the company's part.
At one place I worked, they were wanting on-call shifts of 12 hours being no more than 15 minutes away from being logged on (it was a multi-week Big Event) with the promise of, maybe, time off in lieu as recompense.
They were most displeased when I declined this opportunity.
> 4. You're not up more than 1 week per 1-1.5 month.
That seems excessive if you're expected to be able to log into your work system within X minutes.
Having to be essentially home, near a computer, 25% of the time (1 week out of 4) is a pretty heavy burden, especially for people who prefer to be out, rather than home.
It's a heavy burden, and a lot of teams might want to consider a longer interval, but there's a lot of legitimate scenarios where it's just not feasible to distribute an oncall rotation among 8+ people. I would definitely point more towards 1 in 6 as the ideal minimum.
I would add a "there is time reserved to write automation to reduce toil".
I have a fair bit of experience in teams with developers hating oncall and the critical issues happen in two camps generally:
- In some cases the org was absolutely open to give them time to resolve issues and automate stuff away (PMs were proposing months for fixes only, and cleaning up boards, etc), there just was no interest until enough escalation happened (creating a massive conflict between ops and devs in the meantime). Even offering to do the work was met with a "stay in your lane" kind of response and no collaboration at all.
- In others the company just did not care, deprioritized tickets until things blew up completely, then finger pointing started and all that nice toxic bullshit.
On-call means you have to plan your free time around being available to work. That's never going to be "just fine" for some of us, no matter how it's structured.
My one weird trick is to have a zero tolerance policy for flaky monitors/tests. If it’s not accurate, we either have to drop everything else and fix it, or disable that alarm entirely.
Like they say, normalization of deviance is real, and the only way to fight against it is to have every form of deviance be a problem.
For extra credit: if you have a weekly or monthly team meeting, include an agenda item for the people that were on call so they can debrief the team on what alerts fired and what the resolutions were. As a team, you can then decide which alerts need to be deleted or need adjustments, and if there are additions or edits that need to be made to the runbook.
A big thing to avoid in this whole process is "naming and shaming." The Google SRE book calls this a "blameless postmortem culture," and it's helps you avoid perverse incentives for people to hide or obscure latent production issues.
Yeah. Also, unless you are a genuinely essential application—like air traffic control, a hospital, or a nuclear power plant—you can live with a few hours of downtime.
AWS goes on the fritz for a few days out of of every year and breaks half the internet. Your business will be okay.
99% of the time your shit just isn't that essential. That's the One Weird Trick: don't get suckered into thinking your corporate vision is so important that it can't have an issue wait until morning.
I tend to agree with you, but found an exception a while ago.
A certain file has to appear before a specific time, or else some people don't get money they deserve and rightfully get very angry.
Except 1 or 2 times per year there is nobody in that situation. No payments have to be made. No file appears, as other alerts would signal an empty file.
As the relevant time was in business hours and the thing was important, I decided to swallow my pride and accept that invalid alert.
Resolution procedure is documented as: Call team X, and ask if this is correct. If yes, blackout that alert for 24 hours
No, but, the point is there are never alerts that aren’t alerts.
This is arguably worse, since it’s otherwise a very important alert so anyone that sees it will freak out. Since it happens only 2 times a year, anyone seeing the alert for the first time (depending on churn, this may happen quite often) is going to think it’s really important.
Hopefully you don’t regularly get these alerts because something actually went wrong (say once a year), but that means 66% of all alerts you get are false alarm.
Unacceptable. Make sure that the file is there but empty if nobody needs to get money. If empty could be a failure case, have a 'this page intentionally left blank' type arrangement for the file contents. Done, no monitor exceptions.
Sorry, been there done that. Layout of the file is dictated by an external party. Empty or dummy file is major bad news that blocks other payments. If the file exists, it should have at least 1 valid payment. Payment of 1 cent to a dummy account is illegal. File is produced and consumed by 2 different software packages from 2 different vendors. If the business knowingly creates an invalid file, they commit fraud.
To be honest, calling a human and asking if they are really sure in this case is probably a good idea.
I never said 1 cent payment. I didn't know diddly squat about what your domain was or any specifics. You just said that that was a good case of 'bad monitoring is OK'. It's not. Maybe your hands are tied, I'll give you that. It won't make it good though.
Also I said 'this space intentionally left blank'. Whatever actual form that would take in your example. Just because multiple entities use the same interface does not mean a bad interface is suddenly a good interface.
To take an analogy and stay in payments (not sure what your specifics are). If your live system for CC processing OKs a payment from 4111 1111 1111 1111 you're toast. Your system better only 'process' that in a test environment. Perfect for a dummy row that gets ignored but ensures your monitoring does not trigger.
And yes, been there, done that too with these file based interfaces.
(it's funny how some systems that don't actually process your payment right away actually let you book with a known dummy CC - in Prod. Technically OK because you'll pay for real later - think hotel. Still funny)
To be clear: Not a good case of bad monitoring OK. More an unresolvable case given the constraints. I've had to tame plenty of cases of bad interfaces, and that was the main one that evaded any decent resolution.
After years of iteration, here’s what our team does.
The team is remote and distributed across multiple time zones ranging from West Coast US to Western Europe.
This gets us as close to round the world coverage as we can have.
There are two people on call for each shift, each shift lasts a week.
It will typically (but not always) be one person from US and one person from UK/EU. This helps reduce the single personal cost and spreads it out so what might be night for one person, is morning for the other and vice versa.
All of our alerts are prioritized/categorized to help prevent alert overload.
For example, an alert for a test/QA environment will not fire outside of business hours, and it has a much longer time before it’s required to be ack’ed or resolved.
There are two on-call rotas: critical and non-critical.
Critical, production-impacting, and/or client-facing alerts are dispatched to the critical rotation.
The non-critical rotation only escalates alerts during business hours, again, with a more lax timeline for acknowledgment or resolution.
People are not part of both rotas at the same time.
If there’s a big enough incident, the folks on call get to take off that next working day or the next one.
I (the manager) am on call 24/7 for escalation.
Anything that is an annoyance during on-call is a candidate for review and change.
That can be anything from thresholds to code to upgrading some IaaS/SaaS subscription. Or even straight up disabling the alert if it provides no value.
People can swap on-call days as they want.
Typically, this happens if there’s a birthday, personal event, or PTO, and it’s worked out among team members. If no one else is available, then I’ll take their shift and act as primary.
How does it work for you to be on-call 24/7 for escalation? I get that that ends up happening for many committed founders/operators/managers, but I struggle how that can be a real strategy.
Are you never off-grid for a bit, or drunk in a bar, or just on a real no-work vacation? There seem to be situations where being on call just isn’t feasible.
I was effectively oncall 24/7 at my job at times in 2020.
I barely noticed the pandemic. I never strayed far from my computer. Also, yes, I tried not to drink much.
I certainly learned what my limits are. People think I am a pretty good engineer (not amazing) but what I am known for is being able to keep that level of performance up for a long time.
For my part, despite my reputation, I tried to quit a few times. Not the job, the company entirely. I have never cried at work, but came close once or twice after being up for days and unwinding from a big escalation.
I think it’s a little disrespectful to resurrect someone’s comment they tried to delete. It’s their comment, and we don’t know their reasons for deleting it.
I'm from PeopleSoft / ERP world. The same PeopleSoft HCM base application would generate one off-hours pager alert a month at one client; and half a dozen a night at another client; due to different customization/implementation/complexity of business logic and data.
Any on-call/on-shift rotation system must be viewed through the lens of actual demand and need.
At first client, we had 3-4 developers total who shared pagers on weekly basis, as per the OP, with no undue stress or impact on their day job.
At second client, we now have multi-tier support starting with on-shift (junior but specialized ops team members who stare at computer overnight and provide immediate response), Tier 1 and Tier 2 on-call support, and multi-level escalation rotation.
And yes, there are still people who get woken up all the time always, because buck eventually stops there :-/ . Being on call sucks, as per the title. I've been in 24x7 escalation roles; I don't drink to begin with so that's not an issue, but it absolutely had significant negative impact on my social & family life, sleep and stress levels. I've spent significant effort to a) Make the system better, both in terms of more reliable application, and deeper and more self-sufficient support team tree, and b) Move myself out of the role, though that relies on success in a).
I do find it fascinating to occasionally meet very senior people, with family and social life, who are positively EAGER to be on-call and engaged for every little thing all the time always - and then, unfortunately, have same expectations of literally everybody else ("Let's all come on bridge, always, for everything, anytime")
Another data point from a startup, we have a few people also globally distributed on on-call, we use each other as backups in case as the commenter suggested you're drunk at a bar or camping. Our site doesn't break much (haven't had an incident for over a month now, maybe something happens once every 1-3 months and it's usually not severe) and we have redundancy in the schedule so we've come to not feel so neurotic about it.
Even in all the situations you’ve listed, I still have my phone with me.
If I’m going to be out of cell coverage (e.g. a plane ride, or in the countryside with spotty Internet) or simply want to be left alone, I usually plan for that in advance and do a combination of: 1) scaling back our risk exposure by rescheduling work (which requires you to have a good understanding of the business, its needs, and its timelines) and 2) shoring up the bits I feel most weary about through code, documentation, tooling, and/or contractors.
The same goes for the team SMEs: reschedule where I can, crosstrain where I can’t, get headcount where none of the prior work.
I’ve been on call since 1999. I had to figure out a pattern that worked for me (and my family) but wouldn’t result in a life that was boring or worse, one that I resented.
I’m paged if the two people on call both fail to ack, which is extraordinarily rare.
But what it’s there for is if the team is experiencing something that is new/novel (where I can provide some targeted guidance) or the situation is spiraling and will get worse before it gets better (where I can provide air cover).
> Anything that is an annoyance during on-call is a candidate for review and change.
What does that mean in practice?
Hopefully: "This woke me up last night. It's now the top priority until it's fixed so it never wakes anyone up again. Sorry product manager, your new feature will have to wait."
"This woke me up last night, but it could have been something that didn't need escalation outside of business hours."
"This woke me up last night and had I snoozed for 5 more minutes it could have been catastrophic, let's get some more proactive monitoring in place."
"This woke me up last night and it was triggered by bad user input, we shouldn't get alerted on this but more importantly, we shouldn't allow users to submit this crap."
Very rarely do I encounter alerts that are traced back to some deep architectural flaw that requires me to tango with a product manager and their roadmap.
Often times, our team escalates to the engineering lead in question and a small bug fix is slipped into the next release.
My last gig, on call worked well, I thought, for a few reasons: it was our services that we wrote, it was 1 wk out of 6 that you were on call, we heavily prioritized fixing unactionable alerts and automating fixes -- every alert had a runbook entry that described the non-automated fixes, and while on call your or sprint commitments were not counted,
That last point was very nice as it meant you could work on whatever you felt was most important for quality of life improvements all week long while not fielding on call issues. This meant that I looked forward to on call.
Honestly, that last point seems like something that would make on-call extremely palatable. Essentially acknowledging "on-call sucks; in return, here's the latitude to work on whatever you happen to think is important/interesting"
It is OK being on call on your servers and your software.
I think it is also why less and less people having stuff on premise and going for SaaS/Cloud solutions.
If someone wants to have my software on their servers and me not having any access better they have dedicated person for running it and I don't care even if they pay $1000 per hour - I am still not dealing with server I don't know talking on the phone with admin that has no clue how my software should be configured.
My company(UK) recently tried to force on-call on all engineers.
The initial wording was very restrictive, like 5 minute acknowledgement time and 15 minutes at-laptop. 24/7 for 7 days. They tried to have this implemented without any extra remuneration or perks for the on-call engineer.
On top of it possibly being very illegal, it seems very immoral to spring something like that on people that did on agree to it when they took the job.
I fought for it and I got them to change their policy in 2 mostly meaningful ways:
- It's an opt-in method
- On-call engineers get paid extra for just being on-call and get extra time off whenever they need to actually do something.
This makes sure that you only get people actually willing to do it and there is an incentive. I think it's been quite a successful program!
Luckily I didn't need to get them involved, but in the UK there are unions starting to form for tech workers, I suggest you join one like https://prospect.org.uk/tech-workers
A company I used to work for asked me to do on-call, it wasn't in my contract, I declined, that was that.
I don't understand what "force" means in this context - the conversation went something like "I have commitments outside of work" and that was that. I mean, there was a back and forth, but yeah, at the end of the day I took the job knowing I'd be available for the hours they wanted when I took the job.
Indeed, which is why I think they ended backing out. But even if it could, there are definitely better ways of handling it. The deal we ended up getting is one that benefits all sides and I wish more companies would adopt.
>In a call I was explicitly told "every company does it like this, if that's not ok you might not be a right fit for this company".
In situations like this it's helpful to have a no-management backchannel team chat group set up so you can use it synchronize a series of "nope, not doing that".
I joined Prospect because my company tried to implement an unspoken on-call arrangement, whereby they would try to call me on my mobile 24/7 expecting an immediate response. I asked what the additional renumeration is for that, and they said there isn't any.
Now I'm a Prospect member, and my mobile is always on mute.
I used to work for an MSP. They billed 2-3x the normal rate for on-call to clients. We, however, were simply paid our hourly rate plus overtime. It created a perverse incentive to have as many on-call events as possible as it was very profitable for the company. They billed minimum time to clients, but we were told we could only bill for the exact minutes spent working.
Yeah oncall is horrible. If things needed to be up 24/7, then some team should be staffed 24/7 around the world.
The worse part of oncall is the control of your life it has. for one week I can't do anything I would normally do. (if your company actually compensates for this, let me know where i can a apply, or better, if it doesn't have oncall at all!) Of course managers are never oncall 24/7. The worse is they give the excuse well im on call all the time by default since im the one manager. But theyre not reorganizing their life and putting their off work hobbies on hold becasue of it are they?
> a monitoring change that fixes some flaky alert that might page somebody about once every six weeks.
These kind of things suck. I was on a team where we had tons of these, 10 alerts like this mean your getting pages all the time. No single alert is worth the time investment. Worse was a manager insisted there will always be a base line of alerts that go off and we will just live with it.
Teams never seem to understand how to alert on stuff. Ive been paged for things going off, that might indicate a problem, then you get stuck sticking around because someone else wants to just wait and see what happens. "We should just be cautious" Its impossible to push back on these things, your just going against someones gut feeling, like maybe one day we will want to know, and everyone needs to protect them selves.
The problem with oncall teams that are different to the usual SRE/DevOps teams are the lack of understanding of the system. This can obviously be fixed with good documentation, but in reality, no one has good enough docs. The second problem is actually building that team with the skill needed. Someone with the skill to fix complex systems is not going to want to stick around as 1st line on call support.
Most places at some scale have some meaningfully defined escalation path. That way you staff people with varying understanding of the system(s).
> This can obviously be fixed with good documentation, but in reality, no one has good enough docs.
One problem I've witnessed related to documentation about on-call issues is the over reliance on the SOP concept. They only commit to one level or one pass of analyzing the issue. They do not future drill down, either by linking to other notes or reviewing the issues deliberately. It's like they read about the 5 Whys and decided why not just 1 why.
yeah thats true, there must be some kind of middle ground between an ops team like that and the can't go to the bathroom without your phone oncall we have though.
> Teams never seem to understand how to alert on stuff. Ive been paged for things going off, that might indicate a problem, then you get stuck sticking around because someone else wants to just wait and see what happens. "We should just be cautious" Its impossible to push back on these things, your just going against someones gut feeling, like maybe one day we will want to know, and everyone needs to protect them selves.
From an ops person: if an alert does not have:
- clear, provable impact on customers (internal/external)
- clear documentation (e.g. runbooks) on how to solve it
It should not be an alert. I took this path (successfully) when trying to remove spurious alerts that existed only for the ego of someone (most absurd example, something that started complaining when p99 for some endpoints went >500ms and happened every day when we downscaled the ASGs because business hours were over. No clear path to resolution, and impact was a couple pages opened a bit slower sure - but the number of customers using those pages after hours was <1%!
It sucks, definitely, but the best way to go around those alerts is to prove they're pointless or a waste of time or can be automated around and should automated around (and I've seen so many servlets leaking memory triggering OS alerts for OS teams or spawning infinite threads and never cleaning up after themselves...).
If the company does not want to do it, and pushes back, I would recommend starting to look for another company. It's sad, but it is what it is. 99.99% of software does not need a follow the sun rotation (or people damned to night shifts), just a bit of thought about what happens when things fail.
On-call is even worse for people with disabilities. I quite literally can't do it unless I stop taking my antipsychotic.
Under ADA, I can not be placed on call, regardless of policy, nor can I be discriminated against for that. On-call is not an essential function of being a software developer, with very few exceptions—all of which have nothing to do with "policy" or "fairness".
Needless to say, companies (and some coworkers) really don't like this.
Not to detract from you comment, but you don't have to have a disability. Some people simply cannot do on-call, disability or not.
I had a co-worker at a previous job, he did two or three on-call rotation and told our boss that he couldn't do it. Mentally it's simply to much for him, especially outside business hours where he felt alone with to much responsibility. In terms of abilities and qualifications he was absolutely able to do the job. Nobody complained or got angry with him over it, because everyone could relate.
At the other end of the scale I had another co-worker, in a more complicated scenario who absolutely didn't care. The payment for the on-call shifts was very good, so he just grabbed as many as possible. He would just take his laptop golfing, no problem. His reasoning: Either he'd know how to fix the problem, if not he'd just to call someone else and hand of the incident.
I didn’t even think about that, but absolutely. I’m not a single parent, but my wife works in retail and just those late hours make daycare pick ups, late afternoon meeting, on-call and incidents a major hassle.
Unless you've tested this theory in court it might not be true. It's almost certainly not as cut and dry as you make it seem. Many companies put the same people oncall who write the code, meaning it literally is an essential function of a software developer to provide oncall support. You'd have to argue in court that it's not really essential but it would be situationally dependent.
That said, I'd hope most places would be willing to accommodate you. Places I've worked have always treated oncall as a kinda optional "right thing to do". I've never seen anyone punished for missing an alert. You'd have a good argument if that were the case at your company but that approach to oncall is not universal.
That would be one of the exceptions. For example, high-frequency trading firms always need developers on call while they are actively operating. Keeping those systems running correctly is essential to their role in the company. Same with small companies who have no other staff, assuming they even have enough employees (15) to be under ADA. :)
For the more common scenario of on-call rotation, it would be very difficult to make that argument because other people can take up the disabled person's shifts.
I'm one of the few persons I ever heard of that actually enjoyed being on-call. I believe it goes with my puzzle problem solving mentality to an extent. Being randomly challenged with a problem to look at where you might not know the solution, simply excites me.
Combining on-call duty with an approach of weeding out repeating issues, build better systems and ensuring that unnecessary calls don't happen is key of course, being woken up 25 times for silly predictable errors is pointless and draining.
And finally having an employer that doesn't expect you to be in at 8am if you've been up all night is also very important, catching up on sleep is necessary to manage your balance and health. But given this freedom, I totally dig it. :)
> And finally having an employer that doesn't expect you to be in at 8am if you've been up all night is also very important, catching up on sleep is necessary to manage your balance and health. But given this freedom, I totally dig it. :)
Check the labour laws in your country, I'm pretty sure expecting people to be in at 8am after working on an incident during the night as part of your on-call is illegal.
In any normal country there are laws to ensure employees get enough rest every day.
A major issue with on-call, and certainly one I've encountered multiple times, is the high likelihood of moral hazard - the people who are responsible for addressing incidents are not the same people who designed and maintained the system at fault. This results in the former team feeling powerless to put out fires which could have been prevented by more robust design, and the latter team having no incentive to improve reliability.
SRE gets this right, at least in theory, by requiring that all production systems be reviewed and approved, including observability and incident management procedures, prior to entering service. This ensures that there is some shared responsibility across teams for maintaining uptime.
>> A major issue with on-call, and certainly one I've encountered multiple times, is the high likelihood of moral hazard - the people who are responsible for addressing incidents are not the same people who designed and maintained the system at fault. This results in the former team feeling powerless to put out fires which could have been prevented by more robust design, and the latter team having no incentive to improve reliability.
The Amazon approach was (still is?) to have the team that develops and deploys the software to be responsible for the on-call rotation for that system.
If you are developing software for such a team, it gives you a direct reason to make sure everything is designed and tested well before it is deployed to production--you (or a teammate) will be answering the early morning alert to fix it if it is not.
A direct feedback loop like that is remarkably effective to prevent the moral hazard and ensures direct accountability when buggy software gets deployed.
> SRE gets this right, at least in theory, by requiring that all production systems be reviewed and approved, including observability and incident management procedures, prior to entering service.
Both doing, and being subjected to reviews suck balls. I’d much rather be on call 24/7 to fix the problems I caused myself.
The most attractive thing to me in that SRE handbook is the error budget. I don't mind poking at your thingy but I'm only going to do it a few times before it becomes your problem instead.
Being on call should just be paid at the normal hourly rate. It's an outrageous setup up: we need you to available but don't want to pay you for it.
I did it for a few years and it made me pretty depressed. By the end of it I stopped giving a shit and just resumed my normal Friday night of having a few beers (or a lot of beers). At first they offered me a rate that was less than min wage. I said no until it was an acceptable rate. Every other company seems to pay terrible on call rates in the UK.
It makes me angry just thinking about this period of my life.
Seriously, if a company wants to make money doing SAAS then they can't expect to steal employee time. My advice, if asked to do it, refuse until they offer very good money for it. Good luck to them recruiting a system expert to replace you.
I've never been in a postion that had on-call responsibilities that was hourly. Everyone of them was salaried positions. Is on-call hourly employees a thing? Sorry, I'm not well versed in how to screw over employees except any of my direct experiences
Pretty normal for government work. My dad was a plumber for LA County, on call pretty frequently to respond to broken pipe emergencies and what not, and got paid an hourly rate on top of normal salary for that, whether he got called in or not. Even as a software developer contractor, when I've worked government contracts, if you're authorized to charge overtime, you do have to actually charge it, and get paid more than your salary, if you work overtime. Otherwise, you're donating labor to an executive agency, which violates the Constitution's granting of exclusive power of the purse to Congress. This is the same basic reason that, if you donate money to the government, you can only donate to the Treasury general fund and not to a specific activity, also why Iran-Contra was illegal (one of many reasons, I suppose).
Right, and in my situations, if the crisis that demanded I be brought in was solved, it was assumed/expected I'd be taking those hours out of my normal work hours. So maybe leaving early one day, or an entire day depending on the severity. On call hours were treated as a multiple vs one to one.
Seemed pretty sane/rational to me that I never minded being on-call. There were the occassion where I was not immediately available, but I notified everyone when I would be and got to work when I said I would.
I can see where this is ripe for abuse by either side, so I'm not trying to brush off those that are getting abused by it.
This article should be called "being on a sucky team sucks".
It sounds like OP just has experience with one bad on call. I've also been there. I even heard of teams with +100 high severity issues per week. But it doesn't need to be.
My current team has the best on call I've ever experienced. It's a mix of a lucky product and some discipline. It isn't rocket science really.
If a team is drowning in ops here are two easy techniques I've seen work well.
1. Whoever is on call, their job isn't only to answer pages, but also to improve the system. This solves the dichotomy of feature work vs ops fix. The ops fix is your job for that week.
2. Have team wide (even org wide!) fix-it days, where everyone works exclusively in operational issues.
Again, we didn't invent this. Look at Google's SRE books on reducing toil. You can adjust the ratios of feature vs ops work as needed.
And if you are in leadership, please acknowledge, celebrate and reward operational work. People tend to work in what they perceive as being valued.
>1. Whoever is on call, their job isn't only to answer pages, but also to improve the system
Way to make being on-call more punishing as well as detract from both goals -- your clients receive worse support and the amount of work done on the fix is inversely proportional to the time you're spending with clients who are calling you because there is a problem, not because they just want to talk.
2 actually works; tech debt hackathons are a great thing, and sometimes ops and dev hands don't communicate as effectively as they need to throughout the SDLC.
The point of 1 is that whoever is oncall shouldn't be working on features/sprint tasks, but actually fixing the underlying problems of the system. This is second to addressing the pages.
Of course if your system is so deficient that engineers spend 100% of their time putting out fires, then you need to address that first. That's what reducing toil from the SRE books is about.
I manage a team which operates services (among other things) for clients. Our aversion to being on-call drove us to build robust systems, automate the heck out of everything and monitor as much as possible. That allows us to spot issues during the day shift before they become problems for the night shift, so on-call duty became over time a relatively relaxed affair for the team.
Years ago, I was a second-shift operator in a computer center for an insurance company. We ran production jobs on an IBM mainframe. When jobs would crash we would write up the error on an ABEND form (IBM called crashes ABENDs for ABnormal END), collect the printout and call the programmer responsible. One night a production job crashed late in my shift about 10:30-11 pm and I woke up the programmer responsible. He seemed really groggy and it took a few minutes for me to describe the crash to him. I would always try to be helpful and suggest options for recovery (you got to know the programmers and what their recommendations would be based on the type of production job). Usually they would hold the job, restart it or say they were coming in to fix, this was back in the day where if they could log in remotely, it was with a clunky CRT terminal.
The programmer told me to just restart the job, I noted that on the form. I came in the next day and my boss called me into his office, his boss was there too. They wanted to know why I restarted the job, which caused all kinds of corruption to the database. They had spent the better part of the day recovering the database, then running the batch job, which meant that that system was unavailable for use by the agents.
The programmer swore up and down he did not tell me to restart the job, said I never called him! He was that deep into sleep. But on the form I noted the time I called him and his response to restart the job, so they believed me.
This highlights a problem with people responsible for multi-million dollar systems being woken up in the middle of the night and having to make quick critical decisions.
Getting woken up in the middle of the night by a blasting noise is traumatic. I still have PTSD from my on call time. I swear that to subject a prisoner to that type of condition, where they are woken up at arbitrary times and forced to solve complex problems, would be considered cruel and unusual and inhumane treatment.
I got reprimanded and almost fired recently for being pissed at someone who woke me up for some bs page. Like of course ill be angry, i got woken up for work for no real reason. Now im extra pissed about the managers coming down on my for this instead of correcting the oncall...
My team has what we call "the strike team" which is not just on-call but even during the day, your job is basically to make everything more robust (as opposed to what we do normally, which is work on new systems and features). So just last week or so there was an alert on Sunday that I then spent the week to fix permanently. These are also services that my team are the sole developers on so I know when I fix something it will generally stay fixed.
On top of this, we have a rotation so each of us is only on-call one week out of maybe every four or five. And although I agreed to it mostly because it was a condition of the job and I wanted the job more for learning how the team worked than I cared about the money, the compensation for being on-call is actually pretty good even if nothing happens. And if on-call lands on a holiday, we get the holiday time as vacation days to spend later.
So overall, while I would prefer not to be on-call, I feel like our team implements it about as well as can reasonably be expected. I expected it to drive me crazy, but it actually hasn't yet.
I used to work for a small systems integrator. We used on-call mobile phones, the "hot potato" was carried by the on-call engineer. Any actual time worked outside of core-working hours was paid back in the form of time-off-in-lieu. The salary generously reflected the being on-call requirement.
The other factor was that a "call-out" was not completed until the root cause was fixed.
I believe the real reason for the compassionate arrangements was that the owners of the business were former engineers and were even available to escalate calls to them if you got stuck. Our personal phones had everybody else's personal numbers in the address book, but we were never permitted to give them out to clients. Clients only had access to the "hot potato" phone numbers, which also received the various paged alerts, etc.
Favorite part of quitting my last nightmare of a job was that I continued to get text notifications that I was the on call person for the night/weekend for ~6 months after I left. I slept well each night knowing I had the something is broken notification number blocked
Yup - those fleshpots where the code base is too convoluted, or everyone is "too busy" to tend to such matters as who gets notified for what (and to, you know, update these settings once people leave the org) probably have lots of other red flags. So you can definitely sleep better knowing you managed to escape that one.
despite working on a financial product where a production fire could actually cost normal people money (worst case scenario) I don't mind being on call at all.
Why? Because I never get called. Because we wrote tests. Because we were effing careful. Because we have safeguards in place. Because IT has put hardware redundancies in place and because we have circuit breakers.
Being on call only sucks if you're being made responsible for crappy software that breaks regularly.
That's a problem if you aren't empowered to make it better. If your team isn't encouraged to care. Better... find a place that does care. Find a place where people hate the idea of being awoken at 4 AM and do everything in their power to make sure it can't happen to anyone.
The last time i got called it was because AWS went down, and we couldn't do crap about that.
Absolutely this. I've worked on teams that do on-call rotations and teams that don't. What I typically find is that when you're on-call you work hard for your time off-work to be peaceful. Everything becomes infinitely better - code reviews, testing, deployments. When you do have an on-call issue that is beyond something simple and transient, it gets brought up in incident reviews for all engineering teams. This keeps everyone accountable and focuses team's efforts on patching anything that causes on-call pain. You really don't want to have to say anything in this meeting. And if you do on-call right and focus on quality, you're basically never doing anything during your rotation. On the flip side, when a team doesn't do on-call rotations the code quality suffers greatly.
I have an opportunity to move into an all-remote role that would require me to be on call 24x7 for one week every 3 or 3.5 months. I asked about the frequency of incidents that required the on call person to be pages and it looked like on a bad week, it was about 7 total - a good week was 0. I’m personally torn on whether or not I’ll be okay with the on-call lifestyle so I appreciated this piece for giving me some food for thought.
That level of rotation is not a deal breaker in itself.
You can basically look at it as 3~4 rotations per year, likely 1 or maybe 2 bad weeks per year.
I've never been on a big enough team to have a rotation that wide, more typical is every 4/6/8 weeks. Beyond that teams usually build out globally and your rotation remains as often but is fewer hours per day (ie - US covers til 8pm when APAC comes in). The bigger concern arguably is how noisy weekends are?
It would be for an internal environment, so weekends would be hopefully quiet. The incident log going back a month or two showed incidents mainly in the EU and US workdays, around 5 AM Eastern to 7 PM Eastern.
We are even in the process of spinning up some product-only oncall for issues that don't need dev "escalation" (e.g. most commonly some customer messed up their credentials or their API token expired end-of-day, resulting in some higher error rate).
Which is why I won't do it. Either hire people specifically to do it, or provide sufficient incentives so that people volunteer. Being a good "first responder" is a skill and not everyone has it or wants it.
Yeah I think most of engineering just has never heard of a night crew... And it works out to the benefit of companies, unsurprisingly.
If you want people to work around the clock, hire enough people to work around the clock. Maybe this means engineer salaries get repriced, maybe it means more and more companies figure out they don't necessarily need on-call.
All that said, the status quo probably won't change, there are more than enough engineers who either don't know better or do consider it priced in to their salaries.
The last place I worked at decided to cut the ops staff entirely and put all devs on support duty. Hilarity ensued because no one wanted to do it and the suits were baffled. They established a hard cutoff date for the transition, and at the last minute had to keep some ops on for a few months. Before that cutoff expired, they laid off almost our entire division, leaving only a fraction of untrained and unwilling staff to keep the lights on.
Even more humor, security hadn't been consulted on this and some requirements when they went to implement it. Any dev, during their 2-week support and on-call period, would have their access to the dev systems removed. Anyone not in their 2-week window would have no access to prod.
This might work at a small company with a limited number of SWE’s. Every company over 1k that I’ve worked at that followed this pattern was a shitshow when it came to incidents. Not to throw it out the window, just suggesting that it isn’t as simple as it seems here. The accountability problems surface quickly, even in very talented software engineering orgs. And of course the only thing harder than convincing a software engineer to go on call is convincing a software engineer who was hired onto a team that doesn’t do oncall to do so..
I've didn't got many but the few "events" that pulled me to help, one had the aim to almost screw up great a romantic date friday night, and the other had even more accuracy (details NSFW).
Being on-call is a deal breaker / non-starter for me.
I see myself helping the guys in operations to stay happy and well equiped to deal with keeping the service up but I am not doing operations myself.
Ever.
A challenge for this is that when the devs keep being hammered with new features and they cannot allocate to improve the system defensivness against these "events" then the thing gets tricky.
I've been at my first software engineering gig for 4 years now, we don't have on call, and I've sworn to myself that I will absolutely never take a position with on-call.
I'm a bit worried that it will hamper my career prospects, especially as I've moved to doing more backend work. But I just can't imagine being tied to a work phone on my personal time -- I have a hard enough time enforcing work/life balance as it is.
On call is a symptom of poorly run company. It's a great signal that you should run far away from any place that requires it.
Most software isn't as critical as we think, and the software that is, is expensive enough to have a properly sized staff.
On call exists for the same reason game devs are paid shit and open source exists... Software engineers don't value themselves properly and love giving away free labor.
This sounds like a very software-engineery opinion. Not saying that you're wrong, I understand that on-call is not that important in software development. But as a System Administrator I think that on-call is useful, else we wouldn't notice any outages happening during the night and would then only start working on it in the morning.
Plus, we have customers who actually work during the night, be it timezones or specific industries, so we can't just ignore outages outside of our engineers work-times. And if you're selling SLAs to your customers with promised up-times etc. you better be able to detect and fix something ASAP.
Overall I think "On call is a symptom of poorly run company. It's a great signal that you should run far away from any place that requires it." is a bit too harsh of a statement.
IMHO if you want 24/7 monitoring, then you need 24/7 staffing. If you try to offer 24/7 service and sell SLAs without actually having support people working 24/7 shifts and faking it through on-call, that's the sign of a poorly run company who is effectively lying to their customers (by saying you offer 24/7 service when you really don't) and lying to their employees (by saying that they are working a regular-hours job when really they're doing 24+hour work shifts).
> IMHO if you want 24/7 monitoring, then you need 24/7 staffing
I've never worked at a company which had 24/7 staff, I guess that would require international teams.
> If you try to offer 24/7 service and sell SLAs without actually having support people working 24/7 shifts [...] that's the sign of a poorly run company
The reason I almost can't agree with this is, that almost everyone I know who's working in IT Engineering/Administration works in on-call. This includes people working at the federal post, railways, biggest banks, largest grocery stores, national TV/Radio etc. And that's all in Switzerland, so I doubt that these companies are all poorly run.
It's also not really lying to customers, because customers know about on-call. They know that people generally fix stuff during business hours, with on-call people watching out for the systems during non-business hours.
I wonder if it's a cultural difference or something. I rarely hear people complain about on-call. Not many especially like it, but it's considered to be a thing you just have to do. Also, everyone I know gets paid extra for on-call, and it being Switzerland, on-call has lots of rules.
Moving 2/3 of your existing engineering staff to a different shift would raise enormous large communications issues for a company otherwise designed for an aligned workday.
Hiring, say, minimum 4 people to do nothing but waiting for a page would be incredibly wasteful and probably ineffective as they'd have no idea what to do if something went wrong because they didn't work on the system.
It's a weird thing about the software industry only. I've worked for actual engineering companies (meaning: we build things in factories), where having your factory or expensive integration lab be idle for 2/3 of the day plus weekends is not justifiable. People were scheduled in shifts, and they worked those shifts. And although there was a concentration of people working regular day shift hours, there was always a full team on-site (not on-call!) 24/7. If you weren't on-site, you were never expected to be on-call.
Is that really weird? In a factory, time is produced things is money. 3x shifts is 3x things is probably close to 3x, at worst 2x money. In software development we've long learned that with 3x as many developers you're lucky if you can keep the same pace on the project, let alone improve. (It could be 2-3x as many projects, but then you have 2-3x as many operational problems again.)
And then you have companies like HP that outsourced to India, East Asia, and Eastern Europe and then found that they could have 24-hour employment shifts through timezones. Shared integration and testing infrastructure got reused between teams working on different aspects of the same problem. Support requests got handled by whatever team was online when the request came in, and written up and handed off to the next if the clock struck 5pm.
> I'm a bit worried that it will hamper my career prospects,
So honestly, it probably won't, depending on how far you want to go.
But lots of the broken teams people are describing here are easily fixed by mandating that the tech lead / architect / whatever participate equally (or more-than-equally) in the on-call process. It's the simplest way to get people with leverage to push back on POs / executives who see it as just a cost to balance.
So if you want to have one of those positions one day, and you want to do it well - you should want to do some on-call, because you need to know if the systems and processes you're designing / managing are hell for those on-call.
The number of situations where this applies is pretty small though in employment % though, and probably shrinking.
I'm curious what you work on. Most things boutique enough to eventually present a "fixed product" today are also usually high-end enough for someone to be doing 24/7 support.
I agree about HN's skew towards startup, but I think most full-time software engineers support businesses that are essentially available 24/7 or at least 12/7. Anything involving selling online, handling money, any kind of SaaS, or any company with worldwide customer support needs something, even if the main product engineers aren't aware of it (usually to the detriment of the people who are on-call). That's lots of startups, but it's also most blue chips or any kind of infrastructure.
> I’m a research engineer at a innovation lab.
I mean, assuming this is private, this is either an incredibly biased/exclusive or a very startup-economy-ish job itself. If your company is otherwise big enough to have a lab it's big enough there's on-call teams somewhere, and if you want to be a CTO/VP/DirEng/Architect you'd better know how they work.
There are vast amounts of software engineering roles you are ignoring. Pretty much all of embedded firmware or hardware driver development. The entire video game industry excluding online play. The many software developers that provide solutions for enterprise software with 9-5 support contracts. Mobile app developers. Whole fields of consulting and integration experts that set their own hours. Pretty much every data scientist or machine learning role ever.
It is online services that are niche. It's just what this website is mostly focused on so it seems more prominent.
> Pretty much all of embedded firmware or hardware driver development.
Any hardware used in something that has an on-call process, itself has an on-call process (or is from somewhere large enough to offer follow-the-sun support). As I said it may not involve the same engineers, but it's there.
> The entire video game industry excluding online play.
That's also a vanishingly small portion of employment in the games industry.
> Mobile app developers.
Again, most mobile apps are... for some kind of web service, or have some kind of backend, and so have some kind of on-call team.
> The many software developers that provide solutions for enterprise software with 9-5 support contracts
Examples? Just because not everyone is paying for the 5-9 contract, doesn't mean it's not there.
> Whole fields of consulting and integration experts that set their own hours.
Yes. What percentage of the industry is independent consultants? Why would you include them when I already qualified "full-time software engineers"?
Be On-Call is working time. Therefore you should be paid for that.
In the team that I work now, it's voluntary to join the On-Call rotation. People get paid for that. To be on-call and for any incident that they have to actively work. We have as well a partially implemented "follow-the-sun" monitoring team, but that's by accident (our team has members from West coast US, Europe and Australia, but weekends are covered by On-Call)
Compared to the article, many on-call situations could suck even more.
In a past assignment, I was on-call for a service managing thousands of customer instances each with a different version of an highly customizable application (basically an SDK) developed by another team.
The result was on-call being basically a working day, getting paged on average every hour and at worst needing to track 2 or 3 war rooms in parallel.
Also you felt powerless because while being accountable for the availability of the service, a lot of the issues actually came from customer implementations overloading their instance or product bugs which would get ignored for years and if a fix was done it would take even more years to be deployed due to the difficulties in updating the application.
The only saving graces were:
* It was in a large international org, so it was "follow the Sun", no midnight calls, only 8 hours shifts (11:00 to 19:00)
* I live in a country were it's a legal requirement to compensate employees for being on-call
Now, I've switched to a team were we handle on-call on services we control end to end (code and deployment), and it has been far less stressful.
I've always been on the ops side so I view being on call as an intrinsic part of my job, regardless of where I work. My current company does it well I think: each team has its own on call rotation covering their own systems (with the ops team I'm on being a backstop) which means that you only get paged for your own bugs. In addition, whoever's on call that week is the designated "person to be disrupted" if someone needs something from your team. If someone needs ops help, they'll go to the ops on call guy first. If I need to ask a question about one of our backend processes, I'll ping the backend on call guy. We have dedicated on call channels where all the alerts and pages go and where people post for help. Most importantly, leadership is just as invested in the product and happy to jump on as we are. The only 4AM page I've had yet, my boss and his boss were on the call before I was. Overall I'm happy with it
I lead an oncall rotation. It's important to stay on top of the pages/alerts, or else the rotation will rot and enter a death spiral where you don't get any sleep.
Every page (particularly the nighttime ones) are root-caused by the team every week. Each page gets one of four things:
1. Fix the code to handle the situation.
2. Tune the alert to increase the signal - resulting in a more actionable page.
3. Re-route to a more appropriate team.
4. Remove the alert if it doesn't help us keep the systems running.
So many pages were "informational" that we couldn't action and didn't indicate a problem that needed to be dealt with. Many others were bugs that people knew about but hadn't worked on because they didn't know it was waking us up! :)
Now, we get our sleep and people are asking to join the rotation!
Paying people to take the pager does not help when the rot sets in, but it does help encourage people to pick up extra shifts.
A personal horror story: I have an on-call shift that is 24/7 for one week, only 3 other people are on the shift. Alerts are frequent, noisy, and happen almost every night so you are practically guaranteed to never sleep fully that week, and the sheer breadth of services + teams we’ve accumulated and lack of any clear specificity in alerts means that I’m almost always at least somewhat confused as to if something is actually broken and, if so, how I actually fix it, even after 3 years (two of which when I lived alone during the pandemic). This was my first job out of college too, I was fully convinced that if I fucked up even once or called the wrong person I would be fired and all of the effort I put into getting this career would be meaningless.
I didn’t even get any prep or mentorship, they just suddenly put me on-call during a major product launch. No extra pay or time off btw, just gotta continue work if you were up all night trying to fix some thing with vague priority & vague symptoms (too much latency on random offline service I’ve never heard of that turns out to be a dev experiment, latency being caused by a laggy database that you find out by finding some random message in splunk and regexing it out and into a graph).
I’m definitely a changed person after it, I don’t really… react as much anymore and flinch every time I hear a default iPhone text notification or ringtone. I don’t know how to fix it either—I don’t know if our team has enough people to spread the load out and I can’t think of any better way to keep track of failures in this labyrinth of services, and onboarding people to the point where they can actually take an additional shift is usually 8-12 months. Even experienced people still get ambushed by new services with zero documentation.
Pros though: I don’t really experience much stress or uncertainty anymore in hard situations, and I seem to be much better at problem solving! I’ve also managed to keep my prestigious job with life-changing pay, which feels much more personally fulfilling than coasting at Google (even if it’s for the wrong reasons).
I'm consistently surprised by the number of startups that have distributed teams with people dispersed around the globe and then have an on-call rotation of one person having on-call 24/7 for a week straight. This to me is a management red flag.
One of the first question I used to ask when interviewing for roles that had an on-call component was "what's the on-call expectancy?" I would recommend asking questions about the "on-call" experience to everyone on your interview loop as well. This is often very informative.
Expecting someone to have no life outside of work 24/7 for one week every 5 or 6 weeks is a real quality of life issue. And to do so without offering extra compensation just seems exploitive.
I used to do incident response against hacking attempts, DDoS attacks, threats like extortion... Like all of these on a regular basis. Something new and extremely stressful every month or so. Here's a fun fact: when your adversaries are on the other side of the planet they get to wake up early and start their day with the full knowledge that you may be just passing out at like 9pm in your locale. I used to answer my phone with "WHO THE FUCK ARE YOU AND WHY THE FUCK ARE YOU CALLING ME" at hours past that at first. I don't really know how I got used to it (Though living somewhere where stuff is happening 24/7 anyways helped), but the compensation was decent.
Some people hate on call, some people love it since they are usually at home playing games anyways or don't mind so much being disturbed during sleep and love to make some extra money on the side.
So just auction the on-call times and have employees bid for them. Naturally, bids for on-call at christmas will probably be lower and for some other times higher. Employer can set a max. compensation they are willing to pay - if there isn't a low enough bid, well, on-call isn't happening. :)
In some companies something like that is already achieved by allowing people to switch times and also exchange compensation. A fully fledged auction is just the next step.
You have the ability to make changes necessary at the infrastructure engineering stage, not at the ops ad-hoc response stage, to prevent or end problems that cause call-outs.
There are those that that view "DevOps" not as an engineering culture but as "people who fix problems in production" and even "do releases Friday at 10pm" and developers (or "engineers") as "people who make changes, and go to the bar Friday at 6pm." Companies that do this sometimes call themselves places that "move quickly, break often." Places like this, you hope never to work for.
My point is that on-call is OK if the people that are on-call are empowered to either fully prevent, or permanently fix, conditions that lead to on-call events. I'm talking about software development and infrastructure design, which I thought is the topic here.
I've done this for many years. The worst case scenario is that the people that are getting woken up at night don't have any control over the root causes of the wake-ups.
If you own and created the environment, and it wakes you up at night, that's one thing. If you're just on-call and responsible for other people's messes, that sucks.
Or it's incentivized in the wrong direction, in the case of something like an MSP that bills more for out-of-hours calls so it's actually profitable to have events happen.
For larger companies: if you have active customers around the clock, have employees around the world, so engineers are never on-call outside their normal working hours. I guess that's much easier said than done, since having employees in different countries might be complex and expensive (maybe HR outsourcing companies like Trinet solve this for you?) and you have to manage employees in time zones that are offset from leadership.
Your org needs to be at either end of a spectrum. Either on-call is mostly quiet, and non-disruptive and truly only there for huge issues that happen seldom. Or you staff up a dedicated 24/7 team. If it's in between you need to plan on getting to one end before you wear out your team.
I think on-call and the quality of life component are highly dependent on the company culture, the types of alerts, etc.
My org on-call was laid out like this:
3 days at a time and then a break of X days (depending on team size - This option was chosen by the team)
Comp time for any incidences (plus manager flexibility, up late fixing something no one expects you in early or at all depending on how it went)
We leveraged a provider to handle alert escalation, rotation, phone calls etc. If someone didn't answer it rotated through to the next person and on up to management.
A regular look back at the type of calls coming in, and re-balance of alerting priorities to make sure if someone is going to get a call out of office hours, it better be necessary. We always asked "Could this have waited"
A general culture of helping out, if you couldn't fix something you could ask for anyone else near a machine to handle it.
A general culture of asking could we have automated a fix for this alert before getting a human involved?
Almost all tools were available via mobile and you would be amazed how often you could fix something from a mobile phone. In fact I fixed some service issue in about 10s in a movie, never missed a beat.
Trading on-call windows was typical and easy.
If your org can't do above and is truly wearing people out then you need to go the other way, and just staff up 24/7 and let people have their lives.
There are many issues with being on-call, particularly in environments where false alerts routinely happen, and where management aren't in the rota so don't directly feel the pain. One of my biggest though is the concept of a weekly rota, with people being on-call for a full week at a time.
Sometimes that works fine, and you'll get no alerts all week, but incidents tend to cluster. If something has changed that caused an incident odds are it's going to have knock on effects, and you'll see more alarms over the course of a week. With a weekly rota you end up with one person handling that, who by the end of the week is completely destroyed.
Anywhere I've been responsible for setting up an on-call rota I've instead gone for daily rotations. That means if you were up in the night last night, someone else is going to be in the night tonight. It also means if nothing happens you don't have to spend an entire week either cancelling plans or lugging a laptop around with you just in case.
Yeah I have a vivid memory of hysterically crying after getting less than 12 hours of sleep over 3 or 4 days during a particularly bad shift. My pager went off 30 minutes before the end of my shift and I lost it. I should’ve passed the pager but I was too “up” to recognize it and my manager at the time was completely clueless about oncall work. Actually at that same company getting paged 8-10 times off-hours during an oncall shift was not unheard of, the burnout was real… I’m surprised there aren’t more lawsuits regarding this because the toll of sleep deprivation on health is well known.
Thanks for sharing that. I had a similar experience at my last job. The emotional/physical toll is so real. On the one hand, guess that's why we get paid? On the other hand, screw that.
How often does someone get a daily rotation though? One week out of four is not uncommon. One out of every four days on-call would really destroy your life though.
Daily rotation seems too frequent to me. Sometimes you need 2-3 days to really deep dive and fix the core issue after the alarm has gone silent. Also, with daily rotation it is tempting to wait it out and punt the problem to the next person.
Fixing the core issue should be done during regular day hours on a normal work schedule, so it does not justify a longer on-call shift. If something needs 2-3 days of deep dive, you should get proper time for rest (i.e. not on-call) between those days.
The most important criteria for me is that I only be on call for the things I built. Night and day difference in terms of tolerability.
If it wakes me, I should have built it better. So I work really hard to build things that don’t page. So I actually want to be on call for the quiet and reliable things I built, in order to experience the reward for all that effort.
I used to be on call too and it sucked. At some point I even bought a personnal pocket laptop specifically so I can still go on bicycle rides when on-call, I also did paddle surfing with phone connected to a bluetooth speaker so I can go back home in case of alert.
Now that I am working for a company who has team members in all timezones this on-call thing doesn't make sense anymore. I understand some companies are not in a global market in term of customers but I don't see why their IT/dev teams couldn't employ people from all over the world. You are much more efficient responding to alerts during your own local office hours than when you were just woken up 30s before in the middle of the night and can barely open your eyes with the laptop backlight on.
Obviously doesn't apply to the one taking care of datacenter duties.
I don't mind being on call at my current place. I'm an app owner so I absolutely am the right person to call if things really get out of control. (So I don't even mind being in the escalation list if I'm NOT on call).
What I cannot stand is when folks insist on not tagging errors that they have no intention (or no timeline) on fixing. If you have no plan to fix something, stop waking me up at 3am for an alert. At a former job, I used to put this out to my superiors constantly. "Please tag this alert, get a ticket into our backlog so we can prioritize a fix". Alerts should be for exceptional situations. We've allowed services like New Relic to convince us that Appdex scores always translate to losing money.
I used to be an SRE manager for an on-prem cloud at a previous job. What they don't tell you is that even though your team of SREs that you manage has an on-call rotation, whenever there's an incident, they call the on-call and then call their manager and get everyone on a bridge call until the incident is resolved. Which meant I was on-call 24x7x365. I barely lasted a year before I bailed. I am a single parent 50% of the time (shared custody of kids) and it's hard to be on an incident bridge until 3am and then have to get up at 6am to get your kids ready for school.
I'll never accept another job in my lifetime that requires on-call rotation (if I can help it).
I think more teams should recognize that some features/products/functionality (especially at large companies is not worth paging for.
Obviously if a customer can't access their bank account or sell a stock then someone needs to be woken up, but if an endpoint has high latency or a business can't fire off a marketing campaign ASAP it can mean entire teams of engineers can avoid the worst parts of oncall - the feeling of dread of waiting on the "ba-dum" in the middle of the night and questions about making plans outside of work - just by shifting oncall to noisily page during business hours only.
I enjoy being on call. I have heavy ADHD, so I am not a fan of routine, well-defined time activities and deadlines, but if I get this call at 2AM, I feel intense adrenaline rush, and I feel like a hero of the night saving the prod.
I'd like to learn more about that, it seems relevant to my situation. Some context (I'm not implying that this applies to you): we have an (well above average) capable engineer with a hero complex. In the most difficult of situations, he'll dive in and save the day (or night) where everyone else fails. He's also the one in the team who resisted standardisation and automation the most which, in a way, lead in the past to situations where his unique skills were required.
Yes, this looks like my case. This is why I had to quit the corporate workforce and start working as a consultant — it provides me enough novelty and suits well to my strengths. I realize that in a regular corporate environment, I can be a liability instead.
Thank you. Do you have advice for me how to create a more welcoming and productive work environment for my colleague?
Edit: especially with respect to the automation/standardisation approach we've taken?
The trick to fixing OnCall is to have strong Government regulations around availability of employees to be available off work hours. If that happens, employers will either drop or relax OnCall requirements, or build processes to manage it (e.g. larger teams, dedicated support teams etc).
It’s really that simple. Security professionals tried to get organizations to implement password requirements and MFA for several years, nothing happened. Then SoC2 and other certifications came along, boom, 2FA everywhere.
So I've been oncall at two major companies (Google and Facebook) and, at least in my experience, this covered both ends of the spectrum. Basically, Google gets it mostly right and Facebook gets it mostly wrong.
At Google, a new service has to be supported by the team that developed it. There'a an extensive launch checklist that includes monitoring, having a runbook, etc. Here's the most important part: you're paid when you're oncall. The amount varies depending on how important the service is and the expected response time but can easily be 5 figures a year. Oncall period varies but a week at a time varies with hopefully 8-12 people in rotation.
Too few people and people get burnt out. Even if nothing happens on an oncall shift, it's an annoyance and a restriction on what you can do. Too many and people tend to forget what to do. So with a sufficiently large team you may end up with some people in the rotation and some people not. That's why the compensation is importatnt.
Particularly large, important and mature services may enjoy SRE support. You can't throw a service over the fence and have SRE deal with it. It doesn't work that way. It typically needs to have been running for at least 6 months and SRE needs to be satisified it's sufficiently reliable, stable and monitored with a good runbook. SRE support is globally distributed and typically means 8 hour shifts during normal hours.
The owning team will often still be secondary support.
Also code has to be owned by somebody. This may be a team but when I was there (some years ago now so it may have changed) this also meant 2 actual people (not just team aliases) had to be owners. This is to avoid abandonware. This very much is a support and oncall issue.
Facebook OTOH is a dumpster fire when it comes to oncall.
Not getting paid to be oncall is (IMHO) one of the biggest mistakes. The mantra is "it's part of the job" but that responsibility is not shared equally. That's the point of compensation.
My experience at Google was that issues were relatively infrequent. What I saw at FB however was that oncall could often be the only thing you did for the week. Noisy alerts, alerts caused by issues in downstream systems that you could do nothing about or would get ignored by their oncall, a bunch of issues raised that some would just ignore until they expired (or closed just prior to going out of SLA as "could not reproduce"), etc. You may also be dealing with code that nobody owns (or, rather, nobody takes responsibility for) for features that are live.
Plus the incentive structure, at least on the product side, was to ship new features. Oncall was often treated as just extra work you have to do on top of whatever else you're doing.
Obviously I didn't see how every team did it so none of this is absolute but I did see a reasonably high number of samples.
It's also worth noting that not everything at FB is like this (eg the Web Foundation people were and I believe still are outstanding). Also, in high-visibility outage situations you have highly knowledge individuals who can and do get involved and know the right people to push.
The FB equivalent of SREs is Production Engineers ("PEs"). There are less of these and more services at FB are supported by the SWEs than at Google (IME).
I got the impression that FB processes and culture were forged when the company had less than 500 employees and they never really adjusted to the greater scale. There are a lot of things that work very well. Oncall just isn't one of them. Nor is code ownership.
Note that the on-call bonus is not entirely known. I had managers try to put my team "on-call" for a product and after I explained to them how Google actually did it, they suddenly said "oh, it's not really on-call. You just have to be ready to answer the pager at any time and respond".
I was also on what was one of the most dysfunctional on-calls at the company- keeping several distributed clusters of unique business-critical mysql instances that failed frequently with an unreliable failover method. For some reason it was a joint on-call with a neighboring team so I was responsible for systems I didn't know about or understand but were busisness critical. At times it was fun, at times it was educational, but at times, it was the worst thing in the world for me.
I've experienced this as well and it's a sign of a really bad amanger (IME).
No manager should give you any resistance to giving away free money from the company to their team. If they ever do you know where their loyalties lie: with their management chain and producing the appearance of efficacy.
A manager should be fighting to give the team any on-call pay they're due.
Related side story: annual bonuses were (and maybe still are) calculated based on salary, level (ie target percentage) and ratings. It quickly becomes known what the base rate is so you can calculate everything. After this, your manager has a pool of extra money and a bunch of sliders for their reports. From that pool of extra money, they can distribute it evenly, weight it towards particular people, etc.
They can even take money away in this process from some people to give it to other people.
But because the formulae are all straightforward, this should be obvious. I have seen:
- Managers take away money from some people to give it to their favorites;
- Give all the extra money to one person; and
- (This is the crazy one) Not give all the money away. That is to say they'd rather not give away this free money and return it to the company. I've literally seen this happen.
That's why I mention it: any sign of a manager not giving their team everything they can should be a massive red flag.
I'd rather not go on call than get more money. That said, I kept such a good eye on prod that my team often identified problems with roll-outs in unrelated products that we didn't have control over and managed to stop them before they broke anything related to revenue. This has a neutral effect on perf.
I had no idea you essentially get “overtime” pay for on call at google. That’s how it should be imo. I’ve avoided on call jobs for the lack of extra pay for doing more work.
The deeper you go into the stack, the more reliable things tend to get, the more mature those systems tend to be and the more likely they are supported by SRE who take things very seriously.
So if you're having an issue with Spanner, first it's likely not a bug in spanner. If it's an outage, somebody has probably already been paged. But if not, paging someone responsible will be answered quickly and treated seriously.
You could've unexpectedly gone over quota on something. More often than not you can alleviate that with temporary quota while you resolve your issue (by reducing your usage, getting more permanent quota or both).
Generally 1) Mitigate if possible from your service's end, while simultaneously 2) paging the dependent service's team to mitigate/resolve; and if this happens too frequently or if it was a particularly bad incident you can push the other team to 3) create a postmortem with follow up AIs if they already didn't do so.
One way to make on-call less bad, ask your employer to let you work normal 8hr days on the weekend that you're on call, in return for having 2 weekdays off the next week. That way you don't have any weekend days that are lost due to having to be available. I was allowed to do that, it was a win-win. Useful for them to have really thorough extra cover over weekend, then 2 uninterrupted days in the hills for me. ;)
Being on call is the bane of my existence. It’s also entangled with the issue that companies would never consider people working a night shift even tho tons of devs prefer to code at night (and they could actually get things done without constant context switching). How is it that we are expected to be on slack, checking emails, and on call after hours but employers rarely pay for cell phones or after hours labor?
"I don't have a phone number, it's data only and my house doesn't have good reception" is what a number of ex-colleagues used when asked to be on-call at companies that asked for it, but gave you nothing for it.
For companies that paid you overtime, it often starts with guilt, "well the rest of the team did it last week/month/year, it's your turn". If they don't incentivize you enough to do it, don't do it.
One of the main reasons I quit being a professional software developer was the expectation of being on call. My sleep and my free time with my family are more important than any job. No matter how much I like the normal work, and how much they pay me, I'm done working jobs that insert themselves into my home life and ESPECIALLY that wake me up in the middle of the night. EVER.
I worked at a place in the 2000's where our main application leaked like a sieve due to not releasing memory from a C++ framework, and two people had to take turns every other night restarting the app (actually it was chopped up into 20 different apps that had to be started in order by hand) every two hours or so. I can't imagine they got much sleep on their night.
> It took a few months to shake that Pavlovian association
It took much, much longer to shake off "J.S.Bach Badenerie BWV 1067" :-D
https://youtu.be/JvxeiTq9bqw?t=27
I can say that I'll woke up for years from the deepest sleep in 'seconds' when hearing this melody.
I wish all PMs would go on call for a week at least. The OP's stint with on-call will be quite useful in his career since he can better intuitively view infrastructure more holistically. Then if the group is doing sprints, the "firefighting" will more easily get prioritized.
I worked for one company that enforced on-call for the entire team. For a reason that's out of scope for this comment, it didn't apply to me, though it should have by the normal rules.
I loved working for the company, but at the same time I would have totally despised the on-call system as implemented.
The main problem was that I was working on client apps. My focus was an Android app, to be specific. There was a completely separate web team that developed the backend; we not only didn't work on the backend, we were prohibited from working on the backend. I never even saw the backend code. We even asked to develop part of the backend we needed at one point, and they refused to let us, telling us that we didn't understand their security requirements and therefore couldn't contribute.
And from what I heard from people who were in the on-call rotation, every single 2AM call was from some badly designed alarm. Designed by their team.
Yes, every one of those alarms was fixed. Some may have been only "fixed" the first time, but I got the idea that each did eventually get adjusted to not be completely spurious.
But what really offended my sensibilities was that the backend team was pushing for the client team to be on-call to cover for what seemed to me to be profoundly poor alarm definition. They were eager to get others on board to cover the on-call rotation because being on-call was such a nightmare.
Another comment [1] points out that the median number of alerts in a week should have been zero. It wasn't close, from what I could tell. And the whole "make them eat their own dog food" approach of putting the people on-call who are actually responsible for writing the code was broken by the practice of including people only loosely associated with the code (as in, we used it) in suffering the consequences of writing the code (or designing the alarms) badly.
In general, if I've broken something that's affected a site or product, I'm happy to fix it, even if it's after hours, though I prefer it to be on a "best efforts" basis rather than a "drop everything and work on it now" basis. What I don't want to sign up for is being roused out of bed at 2AM to fix problems caused by someone who isn't even on my team, where I wouldn't have even seen the PR that caused the problem or any of the related code, and there's absolutely no way I could have prevented it.
This one time I was hired as a tier-3 Unix & Linux support engineer at managed-hosting provider who shall not be named. This was back in my prime, I'd already had 10 ~ 15 years industry experience, so I was hired to deal with the really nasty problems that percolated up the chain, and to solve those problems via ongoing process-improvement loop. Over time my job got easier and easier, because lower tier engineers gained training and better standard, procedures, etc... I was on a two person team, we traded the on-call duty every other week. One day the other person left the company, and that began a comedy of errors...
We tried to fill the position both internally and externally, but the position was mostly vacant. Meanwhile I was covering the on-call 24/7 on a tentative basis, until we hired a permanent replacement. Welp, long story short... I fell victim to my own success. Management observed that there was no apparent disruption to the tier-3 area of operation, and that was mostly true, so they decided to eliminate the redundancy. BIG MISTAKE!
I politely told management that I'd no longer be covering on-call 24/7, and go back to every other week; That I was burnt-out waiting for a replacement, and not being able to plan vacations, or even experience off-hours serenity for fear of the on-call phone ringing. That didn't go very well, I was told that I'd be subject to disciplinary action, and they would make accomodation for vacations, but that I had to remain on-call 24/7.
So I decided to become a party animal, every other week. As soon as I left work for the day, I'd start getting drunk. I had parties every night, or went to parties, bars, or whatever... on the weekend I'd go camping at state or national parks with 1-bar of signal for on-call device, I'd be on a boat in the middle of a lake an hour away from my laptop back at the cabin. Stuff like that, pretty much I'd make myself available 24/7, but there was zero assurance I'd be sober or whatever... every other week.
This one time I went into work, and somebody was joking to me about something in context, like as if I knew what they were talking about, but I had no idea. Apparently I was engaged the night before while blackout drunk, and was semi-belligerent with the tier-2 engineer, but was able to get whatever emergency escalated issue resolved in a most anti-climatic manner. The story I heard was something about slurring-out linux commands from a noisy bar, and threatening to drive home (drunk) to ssh into a customer's system. But whatever instruction I gave over mobile did the trick, and the company barely met the contractual SLA for that incident!
So my plan kinda backfired. Management figured out what was going on, and while being applauded for doing my job, I was formally written-up for being drunk on-call. So I responded by stating that's only every other week, and my on-call rotation is documented on the company calendar. I asked if I was never permitted to drink off-hours while employed at the company, and that got HR involved. Apparently it was not entirely legit to be on-call 24/7, nor to intrude into my personal life to such an extent. So I was asked to sign something, an amended employment contract. I refused, and suddenly my performance reviews tanked the next few quarters, and was eventually let-go via a round of "layoffs". The entire tier-3 was eliminated, and they went with a tier-0 ~ tier-2 hierarchy (whatever that entails). No harsh feelings, I'm still in touch with many peeps.
It was at this point I transitioned from supporting Linux to making Linux. Pursued my passion, and started working full time as open source developer, and lived happily ever after not being on-call ever again. Or so I thought. Turns out software development occasionally has grindy rushes to meet deadlines, the so-called "crunch" culture, and instead of having a company provided on-call device... work peeps were calling my private number, in the middle of the night.
>I was on a two person team, we traded the on-call duty every other week. One day the other person left the company, and that began a comedy of errors...
I knew when I read "I was on a two person team" that you were about to get royally fucked. Been there myself.
I'll never forget so long as I live that "two is one and one is none". I now have a minimum team size of at least 4 developers. Simply not comfortable with anything less.
After being a SWE for a decade in Europe, I have never heard of anyone in my network who needed to be on-call. Is this a US thing or only for devops?
Why would a software engineer need to be on call ever? That just means the CICD/testing/validation pipeline sucks.
Well in finance/trading, you need 24/5 or even 24/7 human monitoring. If you don't respond to that alert, you can wake up and see that the system just has lost tens of millions. Core engineers usually take on call duties, as fast diagnostics and response is critical.
What I have seen in finance/trading is that the required 24/7 monitoring is done by staffing 24/7 in pre-arranged shifts, not having this done by people being on-call while sleeping between their regular 9-5 workdays.
I'm sure there's companies in Europe that have on-call shifts, but I've had several jobs now and the only company that ever asked my team to do on-call outside of regular working hours was a US company and we had to explain to them how German labour laws work (specifically that you have to pay people for being on-call and that there are legally required rest periods).
When I worked as a full time DoD scientist, there was a very set-in-place system for dealing with these situations, and it was normal to pay double overtime for any hours spent by an employee on after hours emergencies. This is the right way to do it. Pre-series B startups largely don't do this, but once you get to series B and C suddenly it becomes a thing because companies realize they either have to legally, or have to to prevent their employees from churning and to protect themselves from people not showing up to put out the fire.
Just do it, and do it early. Do it before Series A. That's the advice I give my consulting clients, and the approach I take with my own companies. By compensating your employees for this time you also take what would be a red flag for many would-be employees and turn it into an exciting perk.