My workplace currently has a similar problem where a resource leak can be greatly increased with certain unpredictable/unknown traffic conditions.
Our half-day workaround implementation was the same thing, just cycle the cluster regularly automatically.
Since we're running on AWS, we just double the size of the cluster, wait for the instances to initialize, then rapidly decommission the old instances. Every 2 hours.
It's shockingly stable. So much so that resolving the root cause isn't considered a priority and so we've had this running for months.
Sounds like process-level garbage collection. Just kill it and restart. Which also sound like the apocryphal tale about the leaky code and the missile.
"This sparked and interesting memory for me. I was once working with a customer who was producing on-board software for a missile. In my analysis of the code, I pointed out that they had a number of problems with storage
leaks. Imagine my surprise when the customers chief software engineer said
"Of course it leaks"
He went on to point out that they had calculated the amount of memory the application would leak in the total possible flight time for the missile and then doubled that number. They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits it's target or at the end of it's flight, the
ultimate in garbage collection is performed without programmer intervention."
At least with the missile case, someone _did the analysis and knows exactly what's wrong_ before deciding the "solution" was letting the resources leak. That's fine.
What always bothers me, is when (note, I'm not saying this is the case for the grandparent comment, but it's implied) people don't understand what exactly is broken, but just reboot every so often to fix things. :0
For a lot of bugs, there's often the component you see (like the obvious resource leak) combined with subtle problems you don't see (data corruption, perhaps?) and you won't really know until the problem is tracked down.
That's super interesting and I love the idea of physically destructive GC. But to me that calculation and tracking sounds a lot harder than simply fixing the leaks :)
> It's shockingly stable. So much so that resolving the root cause isn't considered a priority and so we've had this running for months.
The trick is to not tell your manager that your bandaid works so well, but that it barely keeps the system alive and you need to introduce a proper fix.
Been doing this for the last 10 years and we got our system so stable that I haven't had a midnight call in the last two years.
Heroku reboots servers every night no matter what stack is running on them. Same idea.
The problem is that you merely borrowed yourself some time. As time goes on, more inefficiencies/bugs of this nature will creep in unnoticed, some will perhaps silently corrupt data before it is noticed (!), and it will be vastly more difficult at that point to troubleshoot 10 bugs of varying degrees of severity and frequency all happening at the same time causing you to have to reboot said servers at faster and faster intervals which simultaneously makes it harder to diagnose them individually.
> It's shockingly stable.
Well of course it is. You're "turning it off and then on again," the classic way to return to a known-good state. It is not a root-cause fix though, it is a band-aid.
Also, it means you are married to the reboot process. If you loose control of your memory management process too much, you'll never be able to fix it absent a complete rewrite. I worked at a place that had a lot of (c++) CGI programs with a shocking level of disregard for freeing memory, but that was ok because when the CGI request was over the process restarted. But then they reused that same code in SOA/long lived services, but they could never have one worker process handle more than 10 requests due to memory leaks (and inability to re-initialize all the memory used in a request). So they could never use in-process caching or any sort of optimization that long-lived processes could enable.
> I don't know why my senses tell me that this is wrong
The fix is also hiding other issues that show up. So it degrades over time and eventually you’re stuck trying to solve multiple problems at the same time.
^ This is the problem. Not only that, solving 10 bugs (especially those more difficult nondeterministic concurrency bugs) at the same time is hideously harder than solving 1 at a time.
As a Director of Engineering at my last startup, I had an "all hands on deck" policy as soon as any concurrency bug was spotted. You do NOT want to let those fester. They are nondeterministic, infrequent, and exponentially dangerous as more and more appear and are swept under the rug via "reset-to-known-good" mitigations.
This has been done forever. Ops team had cronjobs to restart misbehaving applications out of business hours since before I started working. In a previous job, the solution for disks being full on a VM on-prem (no, not databases) was an automatic reimage. I've seen scheduled index rebuilds on Oracle. The list goes on.
If you do look into the Oracle dba handbook, scheduled index rebuilds are somewhat recommended. We do it on weekends on our Oracle instances. Otherwise you will encounter severe performance degredation in tables where data is inserted and deleted at high throughput thus leading to fragmented indexes. And since Oracle 12g with ONLINE REBUILD this is no problem anymore even at peak hours.
This is not exactly a new tactic, and not something that would have to have been implemented without any cloud solution. A randomized 'kill -HUP' could do the same thing, for example.
This fix means that you won't notice when you accumulate other such resource leaks. When the shit eventually hits the fan, you'll have to deal with problems you didn't even knew you had.
People will argue you should spend time on something else once you put bandaid on a wooden leg.
You should do proper risk assessment, such bug may be leveraged by an attacker, that may actually be a symptom of a running attack. That may also lead to data corruption or exposure. That may mean some part of the system are poorly optimised and over-consuming resources, maybe impacting user-experience. With a dirty workaround, your technical debt increases, expect more and more random issues that requires aggressive "self-healing".
It's just yet another piece of debt that gets prioritized against other pieces of debt. As long as the cost of this debt is purely fiscal, it's easy enough to position in the debt backlog. Maybe a future piece of debt will increase the cost of this. Maybe paying off another piece of debt will also pay off some of this. The tech debt payoff prioritization process will get to it when it gets to it.
Without proper risk assessment, that's poor management and a recipe for disaster. Without that assessment, you don't know the "cost", if that can even be measured. Of course one can still run a business without doing such risk assessment and poorly managing technical debt, just be prepared for higher disaster chances.
Oh no it isn't. Garbage collector needs to prove that what's being collected is garbage. If objects get collected because of an error... that's not really how you want GC to work.
If you are looking for an apt metaphor, Stalin sort might be more in line with what's going on here. Or maybe "ostrich algorithm".
>Garbage collector needs to prove that what's being collected is garbage
Some collectors may need to do this, but there are several collectors that don't. EpsilonGC is a prime example of a GC that doesen't need to prove anything
EpsilonGC is a GC in the same sense as a suitable-size stick is a fully automatic rifle when you hold it to your shoulder and say pew-pew...
I mean, I interpret your comment to be a joke, but you could've made it a bit more obvious for people not familiar with the latest fancy in Java world.
To be fair this is what the BEAM vm structures everything on: If something is wonky, crash it and restart from a known ok state. Except when BEAM does it everyone says it's brilliant
It's one thing to design a crash-only system, and a quite different to design a system that crashes all the time but paper over it with a cloud orchestration layer later.
I don't see the fundamental difference. Both systems work under expected conditions and will crash parts of it if the conditions don't happen. The scales (and thus the visibility of bugs) change, the technologies change, but the architecture really doesn't. Erlang programs are not magically devoid of bugs, the bugs are just not creating errors
I understand this perspective but a BEAM thread can die and respawn in microseconds but this solution involves booting a whole Linux kernel. The cost of the crash domain matters. Similarly, thread-per-request webservers are a somewhat reasonable architecture on unix but awful on Windows. Why? Windows processes are more expensive to spawn and destroy than unix ones.
I am running an old statically compiled perl binary that has a memory leak. So every day the container is restarted automatically so I would not have to deal with the problem. It has been running like this for many many years now.
I guess this works right up until it doesn't? It's been a while, but I've seen AWS hit capacity for a specific instance size in a specific availability zone. I remember spot pricing being above the on-demand pricing, which might have been part of the issue.
I've realized that majority of engineers have no critical thinking, and are unable to see things beyond their domain of speciality. Arguments like "even when accounting for potential incident, your solution is more expensive, while our main goal is making money" almost never work, and I've been in countless discussions where some random document with "best practices", whatever they are supposed to be, was treated like a sacred scripture.
We are dogmatic and emotional, but the temptation to base your opinions on the "deeper theory" is large.
Pragmatically, restart the service periodically and spend your time on more pressing matters.
On the other hand, we fully understand the reason for the fault, but we don't know exactly where the fault is. And it is, our fault. It takes a certain kind of discipline to say "there are many things I understand but don't have the time to master now, let's leave it."
"certain kind" of discipline, indeed... not the good kind. and while your comment goes to great pains to highlight how that particular God is dead (and i agree, for the record), the God of Quality (the one that Pirsig goes to great lengths to not really define) toward which the engineer's heart of heart prays that lives within us all is... unimpressed, to say the least.
Sure, you worship the God of Quality until you realize that memory leak is being caused by a 3rd party library (extra annoying when you could have solved it yourself) or a quirky stdlib implementation
Then you realize it's a paper idol and the best you can do is suck less than the average.
> "certain kind" of discipline, indeed... not the good kind.
Not OP but this is a somewhat normal case of making a tradeoff? They aren't able to repair it at the moment (or rather don't want/can't allocate the time for it) and instead trade their ressource usage for stability and technical debt.
that's because the judge(s) and executioner(s) aren't engineers, and the jury is not of their peers. and for the record i have a hard time faulting the non-engineers above so-described... they are just grasping for things they can understand and have input on. who wouldn't want that? it's not at all reasonable for the keepers of the pursestrings to expect a certain amount of genuflection by way of self-justification. no one watches the watchers... but they're the ones watching, so may as well present them with a verisimilitudinous rendition of reality... right?
but, as a discipline, engineers manage to encourage the ascent of the least engineer-ly (or, perhaps, "hacker"-ly) among them ("us") ...-selves... through their sui generis combination of learned helplessness, willful ignorance, incorrigible myopia, innate naïvete, and cynical self-servitude that signify the Institutional (Software) Engineer. coddled more than any other specialty within "the enterprise", they manage to simultaneously underplay their hand with respect to True Leverage (read: "Power") and overplay their hand with respect to complices of superiority. i am ashamed and dismayed to recall the numerous times i have heard (and heard of) comments to the effect of "my time is too expensive for this meeting" in the workplace... every single one of which has come not from the managerial class-- as one might reasonably, if superficially, expect-- but from the software engineer rank and file.
to be clear: i don't think it's fair to expect high-minded idealism from anyone. but if you are looking for the archetypical "company person"... engineers need look no further than their fellow podmates / slack-room-mates / etc. and thus no one should be surprised to see the state of the world we all collectively hath wrought.
I've seen multiple issues solved like this after engineering teams have been cut to the bone.
If the cost of maintaining enough engineers to keep systems stable for more than 24 hours, is more than the cost of doubling the container count, then this is what happens
Shouldn't be too high cost if you only run 2x the instances for a short amount of time. A reasonable use of Cloud, IMHO, if you can't figure out a less disruptive bandaid.
> So much so that resolving the root cause isn't considered a priority and so we've had this running for months.
I mean, you probably know this, but sooner or later this attitude is going to come back to bite you. What happens when you need to do it every hour? Every ten minutes? Every 30 seconds?
This sort of solution is really only suitable for use as short-term life-support; unless you understand exactly what is happening (but for some reason have chosen not to fix it), it's very, very dangerous.
Well that's the thing: a bug that happens every 2 hrs and cannot be traced easily gives a developer roughly 4 opportunities in an 8hr day to reproduce + diagnose.
Once it's happening every 30 seconds, then they have up to 120 opportunities per hour, and it'll be fixed that much quicker!
In a way, yes. But it's also like a sledge hammer approach to stateless design. New code will be built within the constraint that stuff will be rebooted fairly often. That's not only a bad thing.
"It's shockingly stable." You're running a soup.
I'm not sure if this is satire or not.
This reminds me of using a plug-in light timer to reboot your servers because some java program eats all the memory.
Our half-day workaround implementation was the same thing, just cycle the cluster regularly automatically.
Since we're running on AWS, we just double the size of the cluster, wait for the instances to initialize, then rapidly decommission the old instances. Every 2 hours.
It's shockingly stable. So much so that resolving the root cause isn't considered a priority and so we've had this running for months.