Eliminating Toil

capableweb · on June 7, 2022

> If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil.21

It's fun the the engineering at Google is so great at recognizing things, while the product/"human" teams (like whoever came up with the account reviews and other parts) seems to suck so much.

If YouTube applied the same view of what should/shouldn't be automated, they could solve the problem of peoples YouTube channels being locked in front of them, even if they don't break any ToS's.

nonrandomstring · on June 7, 2022

The part I really liked is "Is Toil Always Bad?" because quite astute psychology and management ideas appear here. My takeaway is that people who like toil (and there are many zealous oversystemetisers around) are some of the most dangerous to long-term productivity.

However, I really feel for the author, because the distance between this philosophically ambitious position and the reality of using Big Tech products - which in my life are the primary source of pointless makework activity - is sad and frustrating.

Now, one cannot blame the tools for stupid policies that lead their misuse in constructing makework processes, but (to take the Heideggerian stance) they carry certain exacerbating values along with them. Technologies can be 'seductive'.

The point of departure for me, was the definition of "overheads" as justifiable makework according to some value set. Whose values, exactly? And if, in Weber's sense, bureaucracy is an unavoidable side effect of process, then the possibility to design "toil-free" systems is really about complexity management, not post-facto eliminating toil through automation, because that will only break things and introduce more toil (Which is of course the primary theme of Gall's "Systemantics").

treeman79 · on June 7, 2022

At one of our dinners, Milton recalled traveling to an Asian country in the 1960s and visiting a worksite where a new canal was being built. He was shocked to see that, instead of modern tractors and earth movers, the workers had shovels. He asked why there were so few machines. The government bureaucrat explained: “You don’t understand. This is a jobs program.” To which Milton replied: “Oh, I thought you were trying to build a canal. If it’s jobs you want, then you should give these workers spoons, not shovels.”

bombcar · on June 7, 2022

Something can be both - and effectively work as both. Almost anyone can move a shovel even if not very well, but operating heavy machinery has many more dangers.

If your question is “how do we keep a relatively uneducated workforce employed gainfully” digging a canal with shovels Amy be the way to go. Especially if you only have a fixed number of useful canals.

Robotbeat · on June 7, 2022

And that, ladies and gentlemen, goes pretty far in explaining why American infrastructure costs are so high (although I suspect the paperwork burden, NIMBYism, and “citizen voice” are just as important).

I strongly support the Biden Administration infrastructure program. BUT selling it as a “Jobs” program (at a time of near record low unemployment) makes me very wary.

nightpool · on June 7, 2022

"the definition of "overheads" as justifiable makework according to some value set". What does this even mean? How is this "a value" that's "carried" in "Big Tech" software? What part of "the reality of using Big Tech products" produces makework? Making such a broad, sweeping generalization about an entire class of make zeros sense to me. Does Facebook "justify overheads" in the same way that Netflix "justifies overheads" in the same way that Apple "justifies overheads"? Instead of just lazily blaming toil on "Big Tech", I think it's better to recognize that toil happens everywhere that people care more about short-term solutions than optimizing long-term processes—a fundamental human problem, certainly not one that's unique to the latest fad of technology companies.

nonrandomstring · on June 7, 2022

These are good questions. Here are some reading sources to help with your exploration of the subject:

https://en.m.wikipedia.org/wiki/The_Myth_of_the_Machine

https://en.m.wikipedia.org/wiki/Technics_and_Civilization

https://en.wikipedia.org/wiki/Max_Weber

https://en.wikipedia.org/wiki/Bureaucracy

https://en.wikipedia.org/wiki/Systemantics

https://en.wikipedia.org/wiki/The_Question_Concerning_Techno...

https://geezmagazine.org/blogs/entry/jacques-elluls-76-reaso...

prionassembly · on June 7, 2022

Wait. Are you me? Did I write this in my sleep?

rmbyrro · on June 7, 2022

This is by design.

You can't expect a free service to have highly trained human judgement whenever you want.

They need it to run fully autonomous. And does run flawlessly for 99% of the users, which is impressive.

I wish they offered a paid option in the 1% cases. Like an arbitration.

But that would be a cost center for them, and they don't want it.

causi · on June 7, 2022

It's not a free service with shitty customer support. The free portion, viewing Youtube, has no customer support whatsoever. The non-free portion, which provides you video hosting and revenue sharing in exchange for your labor producing content, gives you shitty customer service.

hobofan · on June 7, 2022

> I wish they offered a paid option in the 1% cases. Like an arbitration.

They don't even manage to provide adequate content reviews for the .1% (probably an even smaller percentile) of YouTube creators that bring in the big advertisment revenues for the platform. Providing good support for them would easily pay for itself, and is the single biggest issues creators have been raising for years but nothing has substantially changed.

capableweb · on June 7, 2022

Maybe it should be the other way around, rather than putting the error on peoples expectation of what a free service can do, put the error on the operators expectation what kind of free services they can run?

If you can't run a service without arbitrarily banning people (and inadvertently affecting their livelihood [ignoring the question if people should put their livelihood up to a free service in the first place]), maybe you need to adjust the service you're offering people?

There is no requirement for them to run it fully autonomously, because obviously, they are unable to do so without punishing some of their users without any recurse. That's probably a sign it's really hard to run it autonomously, and they need to do something else.

If you don't want your free service to be a cost center, then maybe change it to not be?

qiskit · on June 7, 2022

> This is by design.

Partly due to a dominant market share. If there was more competition, they'd more about user experience.

vkou · on June 7, 2022

> It's fun the the engineering at Google is so great at recognizing things, while the product/"human" teams (like whoever came up with the account reviews and other parts) seems to suck so much

There's ~4 orders of magnitude difference in the scope of work between the two.

sokoloff · on June 7, 2022

(In Google’s view), perhaps the need for careful review of these actions has been designed away.

stevenjgarner · on June 7, 2022

An easy but poor design choice is that locking the user channel will have no immediate impact on revenue, yet it will lessen the requirement for human intervention (due to "Report", copyright infringement claims, etc).

planetsprite · on June 7, 2022

Same can be said for most organizations. Google's privileged hyper-enlightened culture of organizational "niceness" rests on the bedrock of millions of man-hours of near perfect, rigorously vetted engineering.

I think the main cause of this is that humans tend to become very wrong and misguided without accurate feedback. An engineer who is wrong is proven wrong immediately; their code fails, doesn't pass tests, breaks, etc. A people-person design-minded product-guru takes years to be proven wrong, and even then their tactical obfuscation of reality can morph who ends up being blamed.

throwaway892238 · on June 7, 2022

You should probably not automate all toil. You should only automate toil when the toil cost/effect is more burdensome than the cost of automating it. All automation has a cost, and may or may not create value. Automation should have a positive and timely return on investment. If the ROI is 10 years down the road, you probably shouldn't automate it (yet). If there is a cheaper way to deal with the toil, explore that avenue first.

Several times in my career I worked on projects to reduce toil. Sometimes the project would fail because the time it took to work on them went well past the cost saving estimation. Sometimes they would be completed, but the value created was far less than their cost. And sometimes automation wasn't even the solution, and we just needed to change our process or system, or do some other manual thing that reduced the toil cost. Sometimes we chose to automate toil because we were afraid to take on a larger project we knew would make the toil unnecessary, so we paid for the automation and then later for rebuilding everything. Or toil was used as an excuse to justify a project that didn't really have to do with toil.

One of my biggest mistakes as an engineer was assumptions I made about my work that ended up creating more waste than value. Talk to an outsider about your plans and why you're doing it, take their advice seriously. And if your automation is optional, make sure you have buy-in before you start working on it; i've sunk months on things that nobody ended up using.

A great way to automate toil is incrementally. Typically you have a runbook with step-by-step instructions, and over time you automate one step, then another, etc. The investment is minimal and gradual, it can change over time, and you can target the costliest parts of the toil, optimizing value.

ensignavenger · on June 7, 2022

Some times, the process of automating something provides enough positive returns in and of itself. For example, you might learn how to do new things along the way. Or you might be able to give the task to some one new so they can learn.

Or maybe in the process of automating, you discover new things about the process itself and can improve it.

I agree that one should be careful to consider work priorities and return on investment, but there are often hidden returns to something like this that leaders don't understand and take into full account.

snovv_crash · on June 7, 2022

This misses the induced demand effect of dramatically reducing the cost of the task. There are many things that only happen occasionally because they are annoying and slow. If you reduce the friction suddenly everyone does it 10x per day and the whole company benefits from faster feedback loops.

majormajor · on June 7, 2022

> This misses the induced demand effect of dramatically reducing the cost of the task. There are many things that only happen occasionally because they are annoying and slow. If you reduce the friction suddenly everyone does it 10x per day and the whole company benefits from faster feedback loops.

But that assumes the task is still valuable at doing it 10 times more a day.

One key distinction to think about might be if your task is letting you reduce cost vs increase revenue.

A task that is a "cost" - e.g. if a user wants X, we need to do Y - likely won't need to be done 10x more frequently with 10x increased value if the demand for X hasn't changed. So you make it cheaper for us to do Y when X is desired, so the margin for X is increased, which still might make a ton of business sense, but the top-line boost to profitability is limited to the original manual cost of Y.

A task that is revenue-driving - e.g. "we have to do X any time we're putting together a sales deck for a new prospect" - can have a much higher flywheel effect. Can our existing sales team potentially now bring in 4 times more clients? That could be huge, and so you've both increased margin and top line.

snovv_crash · on June 7, 2022

It doesn't need to be as valuable to still be net positive though, because now it doesn't take human time, just computer time.

Imagine for an ML product, making an accuracy report. If it's slow and required lots of human time, you might do it once a quarter for releases to important customers. If it's cheap and quick then you can run it on every CI run to check for regressions before merging code. Sure, you run it maybe 1000x more and don't get 1000x the value.

But, critically, the value is not the cost savings of not having to run it manually per quarter, the value is the more stable product and avoiding spending time bisecting a quarter of engineering work to figure out where bugs were introduced. And this was enabled by automating.

majormajor · on June 7, 2022

Sure, if the cost of automating here is < the current cost of rework and investigation, you win. And dev time is expensive, so that sort of thing is usually an easy call.

Yeah, it doesn't need to continue to increase in value linearly with repeated runs, it's the summed value that matters.

The "cost" is fuzzy too, often - e.g. time and budget spent on reliability-focused engineers or active troubleshooting rarely drops to 0 if you don't automate anything. It might just make it more expensive to react to incidents!

Maybe turn it away from a "gate" question - "should this thing be automated" - and into a prioritization one - "we could automate so many things, which ones should we do first?"

kubanczyk · on June 8, 2022

I don't know whether it's visible from where you are sitting, but what you wrote is exactly contrary to the TFA. TFA implicitly starts from "what if we forget for a minute that ROI exists, and see where that leads us". (If it seems incredibly wasteful, stay with me to the end.) And where it led them is that the people like you and me are crucial parts of Investment (the I of ROI), and the people group together. The ones that gladly do 90% toil don't like to team up with those that prefer 10% toil.

Because the problem with ROI calculation is that for some areas the ideal amount of toil would be 99% and for other 1%. For both extremes, you'll bleed valuable people, so sometimes your Investment becomes "hire new team just for that" and the rest is peanuts.

To put the thing back on its feet first create a team that takes 50% toil and give them these areas that ROI-wise require approximately 50% toil. Call the team "SRE". Create a team that takes 10% toil. Give different areas. Create a team that takes 90% toil, etc.

bushbaba · on June 7, 2022

An organization generally has fixed resources for automation investment. You should look beyond if the ROI justifies investment, to instead prioritizing the highest ROI items that are most likely to be successfully automated.

blowski · on June 7, 2022

Exactly. It's not unusual to end up with more toil on the automation than you had in the manual process.

Terretta · on June 7, 2022

That's a different problem

baobabKoodaa · on June 7, 2022

Why spend 5 minutes manually toiling on a task, when you can spend 6 hours failing to automate it?

YZF · on June 7, 2022

https://xkcd.com/1205/

carlsborg · on June 7, 2022

> Among the many reasons why too much toil is bad

They missed the big one : human error is a common point of failure. Some of the big outages on GCP were due to ops configuration changes. Gitlab wiped their prod DB one time. KnightCapital suffered death by config error..etc.

pooper · on June 7, 2022

I wonder if writing (bad?) software can also be toil.

Like if I need to change the spelling or add a new configuration setting and I need to make sure to use the same spelling in three places because they are all "stringly(sic) typed", is that toil?

juancb · on June 7, 2022

If we ignore the value judgement and instead look at maintaining a sufficiently large codebase then, yes.

In the below paper an example given is migrating from one API to another. The paper describes a semantically-aware large-scale tool for refactoring a Google sized codebase using map-reduce.

Given the externally visible churn in Google products it isn't much of a stretch to imagine they have similar or worse internal churn. In fact I have heard from xooglers that it was common place to internally have competing systems in different states of development and adoption.

https://research.google.com/pubs/archive/41342.pdf

AnimalMuppet · on June 7, 2022

> human error is a common point of failure.

True. But it is also common to find that software automating the process didn't cover some corner case and you need human intervention. And it's worse if the process assumed that human intervention would never be necessary...

dekhn · on June 7, 2022

Most of my work for SRE was the opposite; I did things manually because the automated systems were guaranteed to mess up some fraction of things. At some point my managers wanted me to automate a hardware management process- I checked and it would take 6 months to deploy the code to prod. Instead, I identified all the broken machines and filed tickets manually- getting things fixed far more quickly without a high rate of false positives and churn (google's hardware repair system churns a lot).

Many of the automated systems at Google were developed by geniuses. Others, not so much, and it ended up making a lot of work for other people.

jeffbee · on June 7, 2022

I was also on the side of hands-on operations in SRE and it earned me no friends in that org to be sure. But I like to think that point of view is still basically correct. The "annealing" people have been working on their wacky automaton for more than a decade now and a critical reading of their publication reveals that it still fundamentally doesn't work.

https://www.usenix.org/publications/loginonline/prodspec-and...

dekhn · on June 7, 2022

Those are exactly the kinds of systems I had to deal with constantly breaking things. That said, I always made friends with SRE (was one for a while) because they own the memes of production.

arccy · on June 7, 2022

and sometimes they control the memes on slack...

rougka · on June 7, 2022

The memes on memegen if you want to be accurate

dusted · on June 7, 2022

While I'm not against eliminating toil, this article does not seem to consider the negative aspects of automation, such as the deskilling that happens naturally.

"This plant basically runs itself, but we do have a human present for if something goes wrong".. 50 years down the line, something goes wrong and nobody has the kind of insight and familiarity with the system that they'd have had it had been manually operated.

jamesmishra · on June 7, 2022

On the other hand, it is very good that the plant operated flawlessly for 50 years.

dusted · on June 7, 2022

Most certainly. Depending on the type of world we're in, it might also have freed people to pursue more fulfilling calls. On the flipside, it might have removed multiple jobs for people who may have had entirely satisfactory lives.

It might have been a important plant and the damages done when it breaks down and the surrounding society discovers they both cannot do without it nor rebuild/recover from it may be far worse than the "cost" of running it at a higher degree of manual operation.

Taywee · on June 7, 2022

> Depending on the type of world we're in, it might also have freed people to pursue more fulfilling calls. On the flipside, it might have removed multiple jobs for people who may have had entirely satisfactory lives.

I feel like the fetishization of the "job" is detrimental to society and individuals. I've been in a situation where I could really quickly develop automation that would replace a team of 15 people hammering on spreadsheets all day. The project was canned because we didn't want to put 15 people out of work. The way I see it, there were some distinct outcomes for possible decisions:

1: The automation project is canned. Nothing changes.

2: I get the automation done, 15 people are out of jobs. They have to seek new employment, or otherwise find out what to do with their lives. They are displaced in an unpleasant way, and the lives of many of them may be heavily damaged for at least the short term. The company saves over a million dollars a year.

3: I get the automation done, the company re-tasks the workers to other parts of the company. They aren't out of jobs, but they have to adjust and re-train and learn new things. Some are happy, most are annoyed, but nobody is seriously hurt. Some people are unable to adjust and maybe eventually get let go.

4: The automation is done, and the company continues paying the employees who now are being paid to do nothing at all. They can relax, or work on hobbies, or whatever they like.

The interesting part for me is that 2 has the greatest advantage for the company, 3 is a good compromise, 4 is the best scenario for the employees without even harming the company more than the cost of my time (which isn't really all that expensive in the big term; less than a day's pay for all the employees together), and 1 is the worst case scenario for everybody, but they chose 1 because the "job" is sacred. 1 and 4 are nearly the same from the perspective of the company, so rather than improving anything, they choose to be inefficient, wasteful welfare.

People want welfare to exist, but they don't like the idea of "freeloaders", so they force people to do useless, Sisyphean work. It's extraordinarily wasteful.

Maybe life could be better for everybody if losing your job couldn't completely destroy your life. We could automate things and improve things faster without having to hold back progress because "people could lose their jobs", we could dismantle destructive industries that currently are kept afloat because "Hey, that's 80 thousand jobs!" People could leave jobs that they feel are ethically wrong, rather than being trapped into doing something they think is evil because they need to feed their families.

wccrawford · on June 7, 2022

>4: The automation is done, and the company continues paying the employees who now are being paid to do nothing at all. They can relax, or work on hobbies, or whatever they like.

I'm not sure this is a great outcome. It soft-locks them into their current position without any great motivation to better themselves. It also creates resentment in those around them. When the company does eventually let them go, only the most forward-thinking will still have useful job skills and can find another job.

Sure, they could quit or request a different job, but how many people can recognize those mental problems coming ahead of time and avoid it? Most people are going to be fat and happy and do nothing to get ahead. I don't even blame them. It'd be incredibly tempting for me, too. In fact, since I've been at this company so long and basically stopped growing, I kind of already have fallen into that trap. It's a pretty comfy trap since I like my work and I get paid pretty well for it. It's just not forward-looking at all.

Firmwarrior · on June 7, 2022

Tangential: It looks to me like you've got some impostor syndrome creeping up on you. IMO you should do some game modding or addon development. It's a piece of cake, you'll look like a genius to your gaming buddies, and you'll have fun in the process

UnpossibleJim · on June 7, 2022

I mean this is the idea of Universal Basic Income, but with that also comes it's own set of problems.

[1] Who, then, programs and builds the next generation of automation... innovation would have to continue. These people would still go in for the grind, I guess it would be for more money but at some point would the tiered tax rates make it worth it?

[2] This would only be for certain sectors of jobs. Waitstaff is still going to exist, and a whole realm of the service industry. This would just lead to an exploited workforce, or striated (more striated) population filled in by immigrant labor and other "invisible" labor groups.

[3] Our dependence on machine infrastructure becomes ultimately vulnerable to attack from foreign intelligence and private actors and we are far from able to defend against it.

EDIT:

for the record, I have argued for UBI and am not against it in the past. I am still for it but not on a huge scale. The UBI that I have argued for would be a $1000/month UBI. This would be a replacement/supplement for SSI, disability, childcare tax credit, school supplies, food stamps, etc. It is not enough to live on but a support system for emergencies and savings.

sky-kedge0749 · on June 7, 2022

A big problem is that most people don't know how to handle that kind of freedom. The social world most people exist in won't accept it. I don't know your life but I think a lot of people who suggest things like your [4] imagine a sort of extended vacation, and maybe for three or six months or so, it can be. For the long or indefinite term a better analogy would be prison.

duggan · on June 7, 2022

A corollary to 1, however is:

Company B does the automation project. They can offer their services for a million dollars a year less than Company A. Company A is eventually outcompeted.

That's market capitalism!

ladyattis · on June 7, 2022

I think this is a good scenario to consider especially now that we're running into this issue just from trying to build passenger rail networks in California where the expertise for it isn't domestic at all.

otter-rock · on June 7, 2022

That's like claiming that the government services that still run on cobol should have been manual office jobs instead.

nightpool · on June 7, 2022

I mean, maybe they should have been? Just comparing apples to apples (it takes X amount of COBOL code to the same thing that Y office jobs would have done), maintaining the office job over a long period of time might be easier than finding and retaining COBOL engineers, because you get more active practice at it—people naturally want to streamline their work, new people always have to be onboarded, etc. I don't think things are ever so clear-cut apples-to-apples though—there are things you can do with COBOL that you couldn't feasibly do with any number of office workers, and there are things you can do with office workers that even the best COBOL probably isn't going to handle well.

fsociety · on June 7, 2022

No no no you can’t replace the COBOL because then the COBOL programmers won’t have jobs. Perhaps the government should offer a basic amount of support for its citizens so that this isn’t a problem.

dikei · on June 7, 2022

In some organizations, including mine, toil is sometimes "reduced" by saying "not my problem" and push it to other teams. It sucks to be on the receiving end of it.

kqr · on June 7, 2022

It depends in which direction it's being pushed. Backpressure is a useful signal that propagates the economic cost of a decision closer to the entity that judges its benefits, resulting in more coherent action.

Tossing the hot potato over to just about anyone else is bad.

noirbot · on June 7, 2022

I've certainly been on teams that are on the other side of this - we're staffed fairly low, and sensitive to toil work because of it, and thus end up pushing things to some other teams to try to help them reduce our toil. Those teams are often much larger, and have just been doing things manually for years for various reasons, and are fine with it. The only thing we can do is try to make it enough of a problem for them that they'll decide to help us out so we're not drowning in work that is normal for them.

basisword · on June 7, 2022

There's nothing worse than this. I had to chase a bug fix (that required a two character string change in the code) for about 6 months as i got a mix of 'not my problem' and 'not a priority'. I could have made the change but, no, the team that owned the product had to do it. Try explaining that to the customer affected.

HL33tibCe7 · on June 7, 2022

And in some organizations, the toil just never gets done at all

trhway · on June 7, 2022

It naturally applies not only to SRE. Toil is a great equalizer - if 80% of say development work is basically toil then a 10x developer is not really that distinguishable from nor useful more than an 1x (Amdahl's law so to speak :) Amount of toil (and not say failed projects/etc.) seems to be a one of the main factors separating the companies with revenues $2M/year/head like Google from the ones with mere $300K/year/head, and one of the best things a mid/low performing company can do is to reduce toil - though usually on practice any such attempt means something like MBA-style "efficiency improvement" measures and processes which add even more toil.

bibliographer · on June 7, 2022

I am somewhat skeptical of the claim that "one of the best things a mid/low performing company can do is to reduce toil". It feels that this principle is dependent on the context to the extent of not being useful guidance anymore.

For instance, if the company is pre-product-market fit reducing toil seems like the wrong investment; doing stuff manually can be the way to go until you find what works (unless the effort investment in toil reduction is trivial).

If the company has reached something approximating product-market fit, reducing toil still ought to be weighed against the other priorities. That (as all technical debt reduction) can do wonders to productivity, but alternatives (e.g. pushing for a new feature) may as well be the better call.

quickthrower2 · on June 7, 2022

I think it is market dominance/monopoly that gives the big revenue per head rather than lack of toil.

TimPC · on June 7, 2022

The difference between $300k/head revenue companies and $2M/head revenue companies is far larger than amount of toil and efficiency measures/overhead. Most notably, there tends to be a fundamental difference in who they hire, how they hire them and how they compensate them. It might even make sense for $100k/year engineers to do more toil than $400k/year engineers.

feteru · on June 7, 2022

That's interesting, I'd say that's one of the least likely fundamental differences, especially with these large 10k+ employee companies. Those differences in hiring should average out when compared to things like timing entering their markets, monopolistic advantages, regulatory capture that are external to the company.

VikingCoder · on June 7, 2022

I'm so glad that toil has been automated.

Now all I need to do is learn this new Domain Specific Language and find all of the exact configuration parameters to express my specific needs. Oh except this tool has leaky abstractions under it, and those tools also have their own DSLs and configuration parameters. And the tools under those do, too. It's all turtles, all the way down.

wjdp · on June 7, 2022

Was confused at first as toil is also 'time off in lieu'. AKA unpaid overtime, where you're not paid but get compensated with holiday.

stevenjgarner · on June 7, 2022

I had never heard that expression before [1]. Is 'TOIL' cultural, or is it just that I am inexperienced having never been an employee?

[1] https://factorialhr.com/blog/time-in-lieu-explained/

OJFord · on June 7, 2022

I assume it's widespread in more developed nations/places with familiar employee rights, since it's pretty basic really - you worked when you shouldn't, here's time off to compensate.

Unless you just meant the acronym - that I'm aware of but wouldn't say it's as common as the concept. To me 'toil' is firstly an English word, secondly an SRE term, and only distantly third 'time off in lieu'.

dontlaugh · on June 7, 2022

It's definitely common in the UK.

basisword · on June 7, 2022

Time off in lieu or the acronym? I've had time off in lieu but I have never seen that acronym before.

Invictus0 · on June 7, 2022

It's also a thing in canada

black_puppydog · on June 7, 2022

Good read, good reasoning.

Just a bit sad that someone at Google seems to have read this and focused on the "Automatable" part going "but that includes basically everything we do!"

cf youtube/contentId, cf account blocking, cf customer "support", ...

hoffs · on June 7, 2022

Content review is not SRE task

black_puppydog · on June 7, 2022

Which is why this kind of writing should have probably been kept in the SRE department, lest some MBA type sees it...

reedlaw · on June 7, 2022

This reads like a positive framing of Jacque Ellul's critique of technique:

> The characteristics of the technical phenomenon are Autonomy, Unity, Universality, Totalization. Technique obeys a specific rationality. The characteristics of technical progress are self-augmentation, automization, absence of limits, casual progression, a tendency toward acceleration, disparity, and ambivalence. [1]

Supposing the harm Google does (e.g. ambivalence towards individuals harmed by algorithms) is a direct result of this totalizing impulse, maybe it's time to question some of the fundamental assumptions present within.

1. https://ellul.org/themes/ellul-and-technique/

j7ake · on June 7, 2022

This article was nice. I wonder if it can be generalized to careers in general ?

Long, satisfying careers often involve proactive, design-oriented approach rather than purely reactive.

The only way to make grunge work an entire career would be if you’re constantly doing something for the first or second time, eg artists, novelists.

Even scientists, they can initially discover something significant, but they keep repeating the work on the same topic without more depth or breadth, the work will become tool.

factsaresacred · on June 7, 2022

Knew I recognized some of this writing before. This book is quoted in an annual letter[0] from Zack Kanter which is also worth a read:

> Eliminating toil allows people to focus on the inherent complexity of the difficult, interesting problems at hand, rather than the incidental complexity caused by choices made along the way.

> Toil can be eliminated...by drawing the system boundary a bit differently. When we use an external service instead of an external library, we’re moving the code outside of our system – thereby outsourcing the entropy-fighting toil to some third party. Not our entropy, not our problem

[0] https://www.stedi.com/blog/excerpts-from-the-annual-letter

natly · on June 7, 2022

Engineers automating themselves. This is why we should be kinda scared of software innovation stagnating. If we don't work on innovation we don't really have a purpose and a job.

krageon · on June 7, 2022

Things are always breaking, everywhere. The people analysing and fixing that are engineers too, but what they do is not innovation. It's maintenance. Nothing wrong with that.

jimbokun · on June 7, 2022

Having a job where I am required to innovate every day, almost by the job definition, is why I like being a software engineer.

lamontcg · on June 7, 2022

Coming from a software engineering perspective there is a certain amount of toil which is impossible to automate away. CI break-fix issues often depend on the surface area of your software as it interfaces with third parties, including the CI system itself. In some cases that surface area can be large and break-fix takes up a considerable amount of time, but that toil is not _repetitive_ and is _necessary_ table stakes based on the system.

And this is after having someone who is extremely aggressive with automation and empowered to do whatever they like to reduce that surface area working on the system. I've taken codebases and hacked out 60% of the lines of code in order to remove brittle external surface area along with unnecessary requirements and contain the project better within its own boundaries and stop repetitive issues. I've taken clever ideas that someone had 5+ years ago out behind the barn and shot them in order to reduce total surface area.

But people can walk into an area with a lot of toil going on and go "oh, I know all the strategies on how to reduce this, I will explain to these people who clearly aren't as clever as me how to do it" without realizing that there's often a minimum level of toil for a project which you can't effectively reduce. There's a nonzero vacuum expectation value of toil in any project, and in some cases it can be quite large. Inherently.

I don't know how many managers I went through who would come and decide to document all the different failures we were having and spreadsheet them and look for the patterns to address them. And every week there would be 2-3 that would come up and they'd struggle with the fact that there was really no pattern, other than that the project inherently touched many different third parties, because it really HAD to, and that those third parties would change, which would then force interrupt driven toil.

There's some point where you just have to hire more people and spread it out. There's no magical incantation to manage your way out of additional headcount.

And I don't think the OP article even touched on re-enginering to reduce surface area and brittleness. Automation isn't the only answer to toil. You can automate restarting a service if it crashes, but its always better to just fix the bug (which may involve fixing architectural issues) and make it stop crashing in the first place.

kubanczyk · on June 8, 2022

Is the "CI break-fix" the term you wanted to use? It has only a handful of hits on the Internet, and your post is the first one. It doesn't seem to be related the contractual https://en.wikipedia.org/wiki/Break/fix

But I don't have any term ready for the very familiar category of problems you've described, well maybe except CI business-as-usual :)

tmp_anon_22 · on June 7, 2022

> Eliminating Toil

With 150,000 employees.

hinkley · on June 9, 2022

And about 4 million servers, give or take a million.

zx8080 · on June 7, 2022

The amount of time needed for a process requiring a number of code reviews, approvals for code style and and architecture ones.

Eliminating toil costs lot of time from every engineer.