My workplace currently has a similar problem where a resource leak can be greatl...

DrBazza · 2024-11-13T12:10:34 1731499834

Sounds like process-level garbage collection. Just kill it and restart. Which also sound like the apocryphal tale about the leaky code and the missile.

"This sparked and interesting memory for me. I was once working with a customer who was producing on-board software for a missile. In my analysis of the code, I pointed out that they had a number of problems with storage leaks. Imagine my surprise when the customers chief software engineer said "Of course it leaks"

He went on to point out that they had calculated the amount of memory the application would leak in the total possible flight time for the missile and then doubled that number. They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits it's target or at the end of it's flight, the ultimate in garbage collection is performed without programmer intervention."

https://x.com/pomeranian99/status/858856994438094848

eschneider · 2024-11-13T15:36:01 1731512161

At least with the missile case, someone _did the analysis and knows exactly what's wrong_ before deciding the "solution" was letting the resources leak. That's fine.

What always bothers me, is when (note, I'm not saying this is the case for the grandparent comment, but it's implied) people don't understand what exactly is broken, but just reboot every so often to fix things. :0

For a lot of bugs, there's often the component you see (like the obvious resource leak) combined with subtle problems you don't see (data corruption, perhaps?) and you won't really know until the problem is tracked down.

xelamonster · 2024-11-13T17:46:01 1731519961

That's super interesting and I love the idea of physically destructive GC. But to me that calculation and tracking sounds a lot harder than simply fixing the leaks :)

braggerxyz · 2024-11-13T13:53:06 1731505986

> It's shockingly stable. So much so that resolving the root cause isn't considered a priority and so we've had this running for months.

The trick is to not tell your manager that your bandaid works so well, but that it barely keeps the system alive and you need to introduce a proper fix. Been doing this for the last 10 years and we got our system so stable that I haven't had a midnight call in the last two years.

tfandango · 2024-11-13T14:55:43 1731509743

Classic trick. As a recent dev turned manager, these are the kind of things I've had a hard time learning.

pmarreck · 2024-11-13T14:35:34 1731508534

Heroku reboots servers every night no matter what stack is running on them. Same idea.

The problem is that you merely borrowed yourself some time. As time goes on, more inefficiencies/bugs of this nature will creep in unnoticed, some will perhaps silently corrupt data before it is noticed (!), and it will be vastly more difficult at that point to troubleshoot 10 bugs of varying degrees of severity and frequency all happening at the same time causing you to have to reboot said servers at faster and faster intervals which simultaneously makes it harder to diagnose them individually.

> It's shockingly stable.

Well of course it is. You're "turning it off and then on again," the classic way to return to a known-good state. It is not a root-cause fix though, it is a band-aid.

lanstin · 2024-11-13T18:34:44 1731522884

Also, it means you are married to the reboot process. If you loose control of your memory management process too much, you'll never be able to fix it absent a complete rewrite. I worked at a place that had a lot of (c++) CGI programs with a shocking level of disregard for freeing memory, but that was ok because when the CGI request was over the process restarted. But then they reused that same code in SOA/long lived services, but they could never have one worker process handle more than 10 requests due to memory leaks (and inability to re-initialize all the memory used in a request). So they could never use in-process caching or any sort of optimization that long-lived processes could enable.

pmarreck · 2024-11-13T20:21:48 1731529308

I never considered "having to reboot" as "introducing another dependency" (in the sense of wanting to keep those at a minimum) but sure enough, it is.

Also, great point about (depending on your architecture) losing the ability to do things like cache results

andrewf · 2024-11-14T01:27:28 1731547648

You kinda want some machines to reboot less frequently than others, so issues don't creep up on you.

You also want some machines to reboot much more frequently than others, so you catch boot issues before they affect your entire fleet.

lanstin · 2024-11-14T01:47:26 1731548846

Not rebooting after software upgrade is an oft repeated mistake.

netdevnet · 2024-11-13T09:11:44 1731489104

> It's shockingly stable. So much so that resolving the root cause isn't considered a priority and so we've had this running for months.

I don't know why my senses tell me that this is wrong even if you can afford it

Retric · 2024-11-13T13:56:37 1731506197

> I don't know why my senses tell me that this is wrong

The fix is also hiding other issues that show up. So it degrades over time and eventually you’re stuck trying to solve multiple problems at the same time.

pmarreck · 2024-11-13T14:37:34 1731508654

^ This is the problem. Not only that, solving 10 bugs (especially those more difficult nondeterministic concurrency bugs) at the same time is hideously harder than solving 1 at a time.

As a Director of Engineering at my last startup, I had an "all hands on deck" policy as soon as any concurrency bug was spotted. You do NOT want to let those fester. They are nondeterministic, infrequent, and exponentially dangerous as more and more appear and are swept under the rug via "reset-to-known-good" mitigations.

crabbone · 2024-11-13T10:57:26 1731495446

Guys might be looking to match the fame of the SolarWinds.

whatever1 · 2024-11-13T10:10:04 1731492604

I think this is a prime example of why the cloud won.

You don’t need wizards in your team anymore.

Something seems off in the instance? Just nuke it and spin up a new one. Let the system debugging for the Amazon folks.

chronid · 2024-11-13T12:31:00 1731501060

This has been done forever. Ops team had cronjobs to restart misbehaving applications out of business hours since before I started working. In a previous job, the solution for disks being full on a VM on-prem (no, not databases) was an automatic reimage. I've seen scheduled index rebuilds on Oracle. The list goes on.

braggerxyz · 2024-11-13T13:49:55 1731505795

> I've seen scheduled index rebuilds on Oracle

If you do look into the Oracle dba handbook, scheduled index rebuilds are somewhat recommended. We do it on weekends on our Oracle instances. Otherwise you will encounter severe performance degredation in tables where data is inserted and deleted at high throughput thus leading to fragmented indexes. And since Oracle 12g with ONLINE REBUILD this is no problem anymore even at peak hours.

xeromal · 2024-11-13T16:41:08 1731516068

Rebooting Windows IIS instances every night has been a mainstay for most of my career. haha

ComputerGuru · 2024-11-14T01:38:03 1731548283

I’ve got an IIS instance pushing eight years of uptime… auto pool recycling is disabled.

znpy · 2024-11-13T12:06:24 1731499584

Amazon folks won’t debug your code though, they’ll just happily bill you more.

snicker7 · 2024-11-13T13:47:57 1731505677

The point is not to spend time frantically fixing code at 3 AM.

Gud · 2024-11-13T16:23:00 1731514980

This is not exactly a new tactic, and not something that would have to have been implemented without any cloud solution. A randomized 'kill -HUP' could do the same thing, for example.

l33t7332273 · 2024-11-13T14:57:29 1731509849

Amazon needs wizards then.

rothron · 2024-11-13T10:56:28 1731495388

This fix means that you won't notice when you accumulate other such resource leaks. When the shit eventually hits the fan, you'll have to deal with problems you didn't even knew you had.

cryptonym · 2024-11-13T09:32:34 1731490354

People will argue you should spend time on something else once you put bandaid on a wooden leg.

You should do proper risk assessment, such bug may be leveraged by an attacker, that may actually be a symptom of a running attack. That may also lead to data corruption or exposure. That may mean some part of the system are poorly optimised and over-consuming resources, maybe impacting user-experience. With a dirty workaround, your technical debt increases, expect more and more random issues that requires aggressive "self-healing".

kenhwang · 2024-11-13T10:19:32 1731493172

It's just yet another piece of debt that gets prioritized against other pieces of debt. As long as the cost of this debt is purely fiscal, it's easy enough to position in the debt backlog. Maybe a future piece of debt will increase the cost of this. Maybe paying off another piece of debt will also pay off some of this. The tech debt payoff prioritization process will get to it when it gets to it.

cryptonym · 2024-11-13T10:41:54 1731494514

Without proper risk assessment, that's poor management and a recipe for disaster. Without that assessment, you don't know the "cost", if that can even be measured. Of course one can still run a business without doing such risk assessment and poorly managing technical debt, just be prepared for higher disaster chances.

Cthulhu_ · 2024-11-13T09:08:07 1731488887

There's nothing as permanent as a temporary solution.

netdevnet · 2024-11-13T09:12:17 1731489137

Production environments are full of PoCs that were meant to be binned

JSDevOps · 2024-11-13T07:43:59 1731483839

This sounds terrible

forkerenok · 2024-11-13T08:55:49 1731488149

If you squint hard enough, this is an implementation of a higher order garbage collection: MarkNothingAndSweepEverything.

There, formalized the approach, so you can't call it terrible anymore.

crabbone · 2024-11-13T11:00:12 1731495612

Oh no it isn't. Garbage collector needs to prove that what's being collected is garbage. If objects get collected because of an error... that's not really how you want GC to work.

If you are looking for an apt metaphor, Stalin sort might be more in line with what's going on here. Or maybe "ostrich algorithm".

zoky · 2024-11-13T11:45:53 1731498353

I think it’s more like Tech Support Sort, as in “Try turning it off and on again and see if it’s sorted”.

mech422 · 2024-11-13T19:29:03 1731526143

LOL - I like that one! :-)

nukethegrbj · 2024-11-13T12:35:45 1731501345

>Garbage collector needs to prove that what's being collected is garbage

Some collectors may need to do this, but there are several collectors that don't. EpsilonGC is a prime example of a GC that doesen't need to prove anything

crabbone · 2024-11-13T14:04:19 1731506659

EpsilonGC is a GC in the same sense as a suitable-size stick is a fully automatic rifle when you hold it to your shoulder and say pew-pew...

I mean, I interpret your comment to be a joke, but you could've made it a bit more obvious for people not familiar with the latest fancy in Java world.

rakoo · 2024-11-13T16:08:50 1731514130

To be fair this is what the BEAM vm structures everything on: If something is wonky, crash it and restart from a known ok state. Except when BEAM does it everyone says it's brilliant

ElevenLathe · 2024-11-13T21:54:21 1731534861

It's one thing to design a crash-only system, and a quite different to design a system that crashes all the time but paper over it with a cloud orchestration layer later.

rakoo · 2024-11-14T18:25:01 1731608701

I don't see the fundamental difference. Both systems work under expected conditions and will crash parts of it if the conditions don't happen. The scales (and thus the visibility of bugs) change, the technologies change, but the architecture really doesn't. Erlang programs are not magically devoid of bugs, the bugs are just not creating errors

ElevenLathe · 2024-11-14T23:44:02 1731627842

I understand this perspective but a BEAM thread can die and respawn in microseconds but this solution involves booting a whole Linux kernel. The cost of the crash domain matters. Similarly, thread-per-request webservers are a somewhat reasonable architecture on unix but awful on Windows. Why? Windows processes are more expensive to spawn and destroy than unix ones.

Jnr · 2024-11-13T16:12:38 1731514358

I am running an old statically compiled perl binary that has a memory leak. So every day the container is restarted automatically so I would not have to deal with the problem. It has been running like this for many many years now.

pronoiac · 2024-11-13T14:53:56 1731509636

I guess this works right up until it doesn't? It's been a while, but I've seen AWS hit capacity for a specific instance size in a specific availability zone. I remember spot pricing being above the on-demand pricing, which might have been part of the issue.

hughesjj · 2024-11-14T09:38:44 1731577124

Yup, don't want to get ICEd out of anything.

Also, sometimes the management API goes out due to a bug/networking issue/thundering herd

anal_reactor · 2024-11-13T07:47:08 1731484028

I've realized that majority of engineers have no critical thinking, and are unable to see things beyond their domain of speciality. Arguments like "even when accounting for potential incident, your solution is more expensive, while our main goal is making money" almost never work, and I've been in countless discussions where some random document with "best practices", whatever they are supposed to be, was treated like a sacred scripture.

MathMonkeyMan · 2024-11-13T07:57:53 1731484673

We are dogmatic and emotional, but the temptation to base your opinions on the "deeper theory" is large.

Pragmatically, restart the service periodically and spend your time on more pressing matters.

On the other hand, we fully understand the reason for the fault, but we don't know exactly where the fault is. And it is, our fault. It takes a certain kind of discipline to say "there are many things I understand but don't have the time to master now, let's leave it."

It's, mostly, embarrassing.

keeganpoppen · 2024-11-13T08:15:19 1731485719

"certain kind" of discipline, indeed... not the good kind. and while your comment goes to great pains to highlight how that particular God is dead (and i agree, for the record), the God of Quality (the one that Pirsig goes to great lengths to not really define) toward which the engineer's heart of heart prays that lives within us all is... unimpressed, to say the least.

raverbashing · 2024-11-13T08:54:51 1731488091

Sure, you worship the God of Quality until you realize that memory leak is being caused by a 3rd party library (extra annoying when you could have solved it yourself) or a quirky stdlib implementation

Then you realize it's a paper idol and the best you can do is suck less than the average.

Thanks for playing Wing Commander!

mech422 · 2024-11-13T19:33:46 1731526426

>> Thanks for playing Wing Commander!

captain america voice I got that reference :-)

c0balt · 2024-11-13T08:58:06 1731488286

> "certain kind" of discipline, indeed... not the good kind.

Not OP but this is a somewhat normal case of making a tradeoff? They aren't able to repair it at the moment (or rather don't want/can't allocate the time for it) and instead trade their ressource usage for stability and technical debt.

keeganpoppen · 2024-11-13T08:11:43 1731485503

that's because the judge(s) and executioner(s) aren't engineers, and the jury is not of their peers. and for the record i have a hard time faulting the non-engineers above so-described... they are just grasping for things they can understand and have input on. who wouldn't want that? it's not at all reasonable for the keepers of the pursestrings to expect a certain amount of genuflection by way of self-justification. no one watches the watchers... but they're the ones watching, so may as well present them with a verisimilitudinous rendition of reality... right?

but, as a discipline, engineers manage to encourage the ascent of the least engineer-ly (or, perhaps, "hacker"-ly) among them ("us") ...-selves... through their sui generis combination of learned helplessness, willful ignorance, incorrigible myopia, innate naïvete, and cynical self-servitude that signify the Institutional (Software) Engineer. coddled more than any other specialty within "the enterprise", they manage to simultaneously underplay their hand with respect to True Leverage (read: "Power") and overplay their hand with respect to complices of superiority. i am ashamed and dismayed to recall the numerous times i have heard (and heard of) comments to the effect of "my time is too expensive for this meeting" in the workplace... every single one of which has come not from the managerial class-- as one might reasonably, if superficially, expect-- but from the software engineer rank and file.

to be clear: i don't think it's fair to expect high-minded idealism from anyone. but if you are looking for the archetypical "company person"... engineers need look no further than their fellow podmates / slack-room-mates / etc. and thus no one should be surprised to see the state of the world we all collectively hath wrought.

resize2996 · 2024-11-13T16:29:08 1731515348

I dig your vibe. whaddya working on these days?

gorkempacaci · 2024-11-13T07:16:04 1731482164

How about the costs? Isn’t this a very expensive bandaid? How is it not a priority? :)

bratbag · 2024-11-13T07:22:32 1731482552

Depends what else it's solving for.

I've seen multiple issues solved like this after engineering teams have been cut to the bone.

If the cost of maintaining enough engineers to keep systems stable for more than 24 hours, is more than the cost of doubling the container count, then this is what happens

JSDevOps · 2024-11-13T07:45:12 1731483912

This. All the domain knowledge has left. This sounds like a Hacky work around at best which AWS will welcome you with open arms come invoice day.

kenhwang · 2024-11-13T07:24:18 1731482658

Depends on how long it takes for the incoming instances to initialize and outgoing instances to fully decommission.

x = time it takes to switchover

y = length of the cycles

x/y = % increase in cost

For us, it's 15 minutes / 120 minutes = 12.5% increase, which was deemed acceptable enough for a small service.

toast0 · 2024-11-13T07:33:10 1731483190

Shouldn't be too high cost if you only run 2x the instances for a short amount of time. A reasonable use of Cloud, IMHO, if you can't figure out a less disruptive bandaid.

dochne · 2024-11-13T08:59:07 1731488347

AWS charges instances in 1 hour increments - so you're paying 150% the EC2 costs if you're doing this every 2 hours

kenhwang · 2024-11-13T09:02:05 1731488525

AWS has been charging by the second since 2017: https://aws.amazon.com/blogs/aws/new-per-second-billing-for-...

rsynnott · 2024-11-13T10:34:57 1731494097

> So much so that resolving the root cause isn't considered a priority and so we've had this running for months.

I mean, you probably know this, but sooner or later this attitude is going to come back to bite you. What happens when you need to do it every hour? Every ten minutes? Every 30 seconds?

This sort of solution is really only suitable for use as short-term life-support; unless you understand exactly what is happening (but for some reason have chosen not to fix it), it's very, very dangerous.

jasonjayr · 2024-11-13T11:29:32 1731497372

Well that's the thing: a bug that happens every 2 hrs and cannot be traced easily gives a developer roughly 4 opportunities in an 8hr day to reproduce + diagnose.

Once it's happening every 30 seconds, then they have up to 120 opportunities per hour, and it'll be fixed that much quicker!

actionfromafar · 2024-11-13T10:43:17 1731494597

In a way, yes. But it's also like a sledge hammer approach to stateless design. New code will be built within the constraint that stuff will be rebooted fairly often. That's not only a bad thing.

bongodongobob · 2024-11-13T07:50:16 1731484216

"It's shockingly stable." You're running a soup. I'm not sure if this is satire or not. This reminds me of using a plug-in light timer to reboot your servers because some java program eats all the memory.

keeganpoppen · 2024-11-13T08:16:20 1731485780

or installing software to jiggle the mouse every so often so that the computer with the spreadsheet that runs the company doesn't go to sleep

Cthulhu_ · 2024-11-13T09:09:02 1731488942

Still infinitely cheaper than rebuilding the spreadsheet tbh.

HL33tibCe7 · 2024-11-13T11:44:00 1731498240

Sometimes running a soup is the correct decision