Scaling up the Prime Video audio/video monitoring service and reducing costs

lastangryman · on May 4, 2023

My word. I'm sort of gob smacked this article exists.

I know there are nuances in the article, but my first impression was it's saying "we went back to basics and stopped using needless expensive AWS stuff that caused us to completely over architect our application and the results were much better". Which is good lesson, and a good story, but there's a kind of irony it's come from an internal Amazon team. As another poster commented, I wouldn't be surprised if it's taken down at some point.

BbzzbB · on May 4, 2023

There was an article not long ago from AWS saying they'll be focussing on cutting cost for customers. Maybe the next step of that process will be pushing their clients off of AWS and telling them to just host on prem.

kypro · on May 4, 2023

I know you're joking around, but no, as they also explained a benefit of cloud (and therefore using AWS) is that it can scale flexibly with their customers' businesses.

If your business invests in physical servers anticipating strong growth next year then later finds out actually we're going into a recession and those servers are no longer needed, then that's a sunk cost.

With cloud if demand drops you can scale up and down as needed. Helping customers cut costs during difficult times makes sense since those customers are more likely to survive and stay with you through good times.

So in context I think this article makes sense since long-term sustainable growth of AWS should be linked with the growth of their customers' businesses.

awaythrow98765 · on May 4, 2023

> If your business invests in physical servers anticipating strong growth next year then later finds out actually we're going into a recession and those servers are no longer needed, then that's a sunk cost.

Cloud vendors also mostly sell minimum use packages for discounts in the range of 20 to 80% (called e.g. "committed use discount" or "compute savings plan"). Lots of businesses use those, because two-digit discounts are real money, but they might find themselves in the same spot as with physical hardware they don't need...

steveBK123 · on May 4, 2023

Yup, and you are paying the premium of cloud forever, which over some vanilla compute & storage can be a lot.

And cloud proponents pretend data center / rack space / server leasing doesn't exist either, for those trying to avoid large up front costs.

Jochim · on May 4, 2023

I'm a cloud proponent because it means not having to sit through hours of meetings to deploy a $5/mo virtual machine.

It also means some poor fuck at AWS gets woken up in the middle of the night instead of me when things go to shit.

It absolutely comes at a cost, and might not be the right fit for an organisation that's absolutely on top of it's hardware requirements and can afford to divert resources from new development work. For the rest of us it saves a lot of dev hours that would have otherwise been spent in pointless meetings or debating the best implementation of whatever half-baked stack has oozed it's way out of the organisation in an attempt to replicate what's handed to you with a cloud solution.

awaythrow98765 · on May 4, 2023

> I'm a cloud proponent because it means not having to sit through hours of meetings to deploy a $5/mo virtual machine.

And endless orgies of "call for pricing" with hardware vendors and hosting. Shitty websites where you can buy preconfigured servers somewhat cheaply, or vendor websites where you can configure everything but overpay. Useless sales-droids trying to "value-add" stuff on top.

Cloud buys are a lot friendlier, because you only have the one cloud vendor to worry about. Entry level you just pay list price by clicking a button. If you buy a lot, you are big enough to have your own business people to hammer out a rebate on list price, still very easy, still very simple. But overall still more expensive unfortunately.

jjav · on May 5, 2023

> I'm a cloud proponent because it means not having to sit through hours of meetings to deploy a $5/mo virtual machine.

I'd hope there aren't actually hours of meetings for a single $5/mo VM?

But I would hope there are reviews and meetings when deploying enough of these to amount to real money. Companies that don't do that soon enough find themselves with a million dollar AWS bill without understanding what's going on.

Spend is spend, it's vital to understand what is being spent on what and why.

Jochim · on May 5, 2023

> I'd hope there aren't actually hours of meetings for a single $5/mo VM?

Slightly exaggerated in the case of the $5 machine, probably 2-3 manhours total but it took 4 days for it to be deployed instead of ~5 minutes. We did spent tens of hours justifying why the business should spend ~$100 more per month on a production system where the metrics clearly indicated that it was resource constrained.

The same IT department that demanded we justify every penny spent did not apply any of that rigour to their own spending. Control over the deployment of resources was used as a political tool to increase their headcount.

> I would hope there are reviews and meetings when deploying enough of these to amount to real money. Companies that don't do that soon enough find themselves with a million dollar AWS bill without understanding what's going on.

I consider the judicious use of resources to be part of my job as a software engineer. A development team that isn't considering how they can reduce spend, tidy up, or right-size their resources is a massive red flag to me. Organisations frequently shoot themselves in the foot by shifting that responsibility away from the development team. The result is usually factional infighting and more meetings.

jabradoodle · on May 4, 2023

It's not really the same spot in that your paying monthly rather than upfront. Devs tend to think about total $, the business/accountants do care about Opex vs Capex.

Also it's going to be simpler to provision your base (commited use) on the cloud and then handle bursts on the cloud, than it is to have your base on prem and burst to the cloud.

adql · on May 4, 2023

> It's not really the same spot in that your paying monthly rather than upfront. Devs tend to think about total $, the business/accountants do care about Opex vs Capex.

You can buy physical servers in leasing ,turning it into opex

You can also rent them for little bit extra via managed dedicated servers from vendors like OVH.

runlaszlorun · on May 4, 2023

I think this point isn’t made often enough.

Not going with a big cloud provider def doesn’t mean that you need to buy physical servers and build an on-prem data center.

adql · on May 4, 2023

And other point I also seen used to lie about cloud cost is saying you save so much on engineers.

...while forgetting to have sane on-call rotation for cloud you also need at least 3 people on that rotation that are also clued in on cloud operation enough. Sure they can be "developers" but if your app architecture requires so little maintenance and flea removal that they are not doing ops jobs much, chances are so would it in either rented or dedicated server env.

awaythrow98765 · on May 4, 2023

That is not really a difference, you may as well lease your server farm in the basement, practically the same cost as buying it, just as a monthly payment with the supposed "advantages" the business people might care about.

Scarblac · on May 4, 2023

> If your business invests in physical servers anticipating strong growth next year then later finds out actually we're going into a recession and those servers are no longer needed, then that's a sunk cost.

Yes, but that sunk cost is probably still lower than what you paid AWS for the option to scale up and down.

NohatCoder · on May 4, 2023

This. And I think people tend not to understand how little actual hardware they are paying for when using AWS et al.

A really cheap server leasing deal will cost you yearly about as much as the purchase price of the server. With opaque AWS services it is probably more like a month of subscription to pay for the hardware that you are indirectly using.

Jochim · on May 4, 2023

I worked for a global company that maintained it's own "cloud" of VMs that we'd use for development purposes.

They were entirely unusable.

Opening a relatively small file in notepad could take multiple minutes. OS click and typing response times were measured in seconds.

Despite wasting thousands of developer hours each year, they refused to upgrade their data center. Probably because doing so would have been a major budget fight that requires an executive to actually advocate for something instead of making their characteristic animalistic grunts of agreement.

For better or worse I haven't seen the same issue with cloud expenditure. It seems to be perceived as a necessary expense, rather than the engineering department getting ideas above their station.

manicennui · on May 5, 2023

Notepad? Clicking? I think I see the problem.

Jochim · on May 5, 2023

Mostly deployed on windows VMs, the only access was to remote in.

Typing/executing in Powershell was just as slow

hobs · on May 4, 2023

I just spent the better part of two years advocating, pushing, and fighting for months to add new bandwidth to our datacenter.

Thankfully after they understood the problem it only took 8 months of procurement, techs going to the data center 10+ times with endless screw ups, and everyone pointing the finger at each other.

While the cloud sucks in many ways the traditional setup has big problems as soon as you hit a midsize company ime.

dijit · on May 4, 2023

I counter your anecdote with mine!

A cloud vendor (who will be nameless as I signed an NDA specifically that prevents me from disparaging them; but one of the big three) ran out of capacity for me and it was 3 months before they managed to fix it. -- that was with a couple million a month in spend.

Cloud is still servers; you just depend on someone elses capacity management skills and you hope that there isn't a rush to populate a location (like when a region goes down and everyone's auto-provisioners move regions to yours)

Jochim · on May 4, 2023

Barring exceptional circumstances, I don't have to fight that fight at the cloud provider though. Their business is more likely to be amenable to maintaining and expanding reasonable levels of capacity.

I have to deal with a grumpy finance guy that thinks my whole department is overpaid already, especially so if we might use the dreaded `CapEx` word.

hobs · on May 4, 2023

3 months vs 2 years, I'll take it :)

NohatCoder · on May 4, 2023

I think the main point here is that there is no limit to incompetence. And sure, having your own servers allow for some goofs that won't happen with cloud (the opposite is also true). But your org had the means to fix the issue, and they choose not to. That has fundamentally got nothing to do with technology choice.

re-thc · on May 7, 2023

Or maybe there's a mid point. It's not datacenter or cloud. There are providers offering physical servers for rent for example. Lots of combinations in-between.

adql · on May 4, 2023

And that you can also lease servers directly from vendors like OVH so you don't even need to bother with the "drive to datacenter and install it" part. It's more expensive but still far cheaper than cloud

gtirloni · on May 4, 2023

Most companies will pay way more for the engineers maintaining their on-premises infrastructure than they would for AWS. On-premises still makes sense when you reach a certain scale. When you reach a certain scale.

NohatCoder · on May 4, 2023

They kind of need to be there anyway, physically maintaining servers turns out to be a miniscule part of the whole maintenance. If you really care about uptime you still need people on-call who can intervene as necessary.

gtirloni · on May 4, 2023

It's not a minuscule part of a small company. I made the point that on-premises makes sense after a certain scale.

Once you have on-premises you need people that know switches, routers, rackmount server, hardware, virtualization, etc, plus keeping all of that properly maintained (security patches, IaC, periodic updates, analyzing performance, making sure it's properly architected, etc).

I often see people saying it's the same cost or less but it's really not. Unless you have no idea what you should be doing.

Scarblac · on May 4, 2023

I don't know, I worked at a few companies that did this early in my career (early 2000s), and it was just the devs or the sysadmin of the office IT that did this sort of thing. There are lots of people who know enough about switches and routers to get them up and running.

Virtualization, IaC, analyzing performance, right architecture etc is all for later, when you've grown enough to need that.

gtirloni · on May 4, 2023

> Virtualization, IaC, analyzing performance, right architecture etc is all for later, when you've grown enough to need that.

Yeah, I think it might be a different perspective about when that all should be done.

I tend to do that right from the beginning because I often see it snowball later on and nobody ever fixes it or does it "properly" (in my opinion, possibly not the right one).

But that's a good point, no doubt.

jabradoodle · on May 4, 2023

You also have to account for the fact your paying upfront the cost for the lifetime of the infra, vs paying monthly.

Scarblac · on May 4, 2023

It's also possible to rent hardware in a rack and pay monthly, that's much cheaper than cloud services.

belorn · on May 4, 2023

Using insurance to cover unexpected costs is always a gamble, one way or an other. A business that invest into physical servers could sell of those servers if they later find out that there is a recession, which might cost more or less compared to a cloud solution.

If a business invest into a cloud infrastructure and create a binding contract for 5 years, only to find out that they actually want to abandon that project a year later, that is also a sunk cost. Long term contracts tend to be cheaper, so its a trade off between saving money vs risk.

It all depend on the risk analysis, how risk averse one want to be, and the economics/liquidity needs.

throwaway283bd · on May 4, 2023

There's a middle ground between cloud and on-prem

You don't need to go all in on buying racks and other hardware, when you can rent servers at Hetzner at a cheaper cost than aws.

zamnos · on May 4, 2023

It depends on your timing. If you're extremely unlucky, you'll buy the new set of servers and the recession will hit right after you sign the PO. Probability says you're not likely gonna be that unlucky, so the recession will hit probably elsewhere in the physical servers' life cycle. A recession hits, and now there's a focus on cutting costs. With AWS, you don't have much choice - if you stop paying the bill, the servers evaporate into the cloud. Physical servers don't. You can change their replacement schedule and just wait arm few years more to replace them. Hopefully the recession has passed by then and you can buy a whole new pile of servers.

Really though, it seems like a hybrid on-prem/cloud approach is one to consider. Software like Anthos eases this, though there are also pitfalls with this approach too.

colonelpopcorn · on May 4, 2023

Scale to zero has always been my favorite feature of serverless compute. If there was a fast way to do it with VMs, I'd certainly consider it.

jjav · on May 5, 2023

> as they also explained a benefit of cloud (and therefore using AWS)

Although in this specific case, being a team at AWS, they are using their own company data centers, so it's essentially on-prem to them.

briffle · on May 4, 2023

We have some DB servers that occasionally need to do very large batches of transactions. They run all month with a couple of CPU's, and a small amount of ram to make sure they are 'caught up' with production, and before the batches are run, get shutdown, and changed to 32 or 64 CPU monsters. An hour or two later, they go back to the 2 cpu servers again. In a non-cloud shop, we would have to size our hardware for that maximum batch size.

codedokode · on May 4, 2023

You can sell the servers and partially return the money.

nodefortytwo · on May 4, 2023

To be fair to AWS, they do work really hard to (at least at an account level) to optimize workloads with you. They do this so overall you'll move more workloads to them.

its quite simple, if workload x can be done 100% cheaper on-prem then its an obvious move (probably) if AWS manage to get that closer to 30-40% then the operational benefits of using AWS make more sense, more workloads, more total spend.

ivanche · on May 4, 2023

You mean 50% cheaper on-prem, right? 100% cheaper is 0.

bootsmann · on May 4, 2023

Still waiting for python 3.11 on lambdas so must not be that big of a focus.

(They finally delivered 3.10 last month at least)

IshKebab · on May 4, 2023

If you're sensitive to a 25% performance improvement, why not switch languages and get a 1000% performance improvement?

techdragon · on May 4, 2023

An easily tested compatible upgrade that gets a free performance boost… vs lots of engineering effort to rewrite… yeah that’s just not going to fly with management. Who are probably looking at the 25% performance boost as a 20% cost reduction not a 20% speed increase

datadeft · on May 4, 2023

It is possible[1]. You are better off this way than the AWS "native" way.

1. https://dev.l1x.be/posts/2023/02/28/using-python-3.11-with-a...

easton · on May 4, 2023

The only problem with that is that Docker lambdas boot slower than lambdas with the built in runtime (not ridiculously slow, but could be 2x or something). God help you anyway if you’re trying to do something latency sensitive on Lambda, but if you are then you probably don’t want to add more time for a docker pull.

Jochim · on May 4, 2023

I used to believe the same thing, I've began thinking it might now be false yet often repeated. AWS claims that:

> Lambda also optimizes the image and caches it close to where the functions runs so cold start times are the same as for .zip archives.[0]

This[1] article shows almost no discernable difference in .NET cold start times between containerised and regular lambdas.

It's easy to imagine developers pushing up bloated images, slowing startup down and blaming docker/AWS for it.

[0] https://aws.amazon.com/blogs/compute/working-with-lambda-lay...

[1] https://www.kloia.com/blog/aws-lambda-container-image-.net-b...

re-thc · on May 7, 2023

The situation does keep changing - AWS does optimize things.

I'm not so sure it's a black/white true/false. Depends on what goes in the docker image. It's something like for larger deployments docker is faster but for small deployments it's the other way.

squeaky-clean · on May 4, 2023

We've actually observed the opposite at our company. Moving from a Python 3.8 built-in to Docker based changed our response times from about 40ms to 30ms on average.

re-thc · on May 4, 2023

> Maybe the next step of that process will be pushing their clients off of AWS and telling them to just host on prem.

And then charging them to use AWS anywhere and outpost!

donalhunt · on May 4, 2023

Archived:

https://archive.is/LFtNg

http://web.archive.org/web/20230504060528/https://www.primev...

dragonwriter · on May 4, 2023

> Which is good lesson, and a good story, but there’s a kind of irony it’s come from an internal Amazon team. As another poster commented, I wouldn’t be surprised if it’s taken down at some point.

Why? Using the model they switched to (which uses a different set of AWS services) instead of the model they switched from is a recommendation that the AWS tech advisers that are made available to enterprise customers will make for certain workloads.

Now when they do that, they can also point to this article as additional backing.

lucideer · on May 4, 2023

Have you had AWS tech advisers advise teams in your company to go with this stack? Because I haven't.

AWS doesn't have an equally distributed interest in selling all of its products. Some AWS products exist because customers need/demand them and others exist because they provide higher margins and tighter lock-in to Amazon: the first type of products are great for customer acquisition, the role of their sales folk is to then convince people using the former to migrate to the latter.

jon-wood · on May 4, 2023

I've never knowingly had an AWS solutions architect recommend something because it would make AWS more money in the short term. The most frequent advice I've seen from them has been on how to give AWS less money by making use of different features, or changing how particular services are being used.

baobabKoodaa · on May 4, 2023

You sound very confident in your estimation of other peoples' motivations and skills. Do you imagine AWS solution architects coming to you directly with "you should pay for this service because it makes more money for AWS"?

I've done the AWS solutions architect associate level cert and I can tell you first-hand experience that in order to pass the exam you need to memorize a lot of AWS propaganda that was written primarily to optimize AWS profit, not to optimize customer satisfaction. How many of those solution architects take those materials with a grain of salt vs how many of them genuinely believe that crap, I don't know.

Ensorceled · on May 4, 2023

I guess it depends on what you mean by "AWS tech advisers" ... I've had two different AWS recommended partners advise us to go to this stack, one of the partners even tried to explain that despite more than doubling in costs we would "make it up in faster development".

Paul-Craft · on May 4, 2023

> Some AWS products exist because customers need/demand them and others exist because they provide higher margins and tighter lock-in to Amazon

I'm still trying to figure out which one Aurora and Cognito fall under.

m_mueller · on May 4, 2023

Aurora is I think pretty simple to move away from, since it's just fully compatible Postgres or Mysql. We even use a local postgres for development purposes against an Aurora solution.

Paul-Craft · on May 4, 2023

Nope. AWS makes it dead simple to move from RDS to Aurora by clicking a button. There's no way to move data from Aurora to RDS short of doing a SQL dump and reloading everything that way. I found this out when my previous employer was looking at moving from RDS to Aurora.

kerkeslager · on May 4, 2023

> > Aurora is I think pretty simple to move away from, since it's just fully compatible Postgres or Mysql. We even use a local postgres for development purposes against an Aurora solution.

> Nope. AWS makes it dead simple to move from RDS to Aurora by clicking a button. There's no way to move data from Aurora to RDS short of doing a SQL dump and reloading everything that way. I found this out when my previous employer was looking at moving from RDS to Aurora.

I got a bit of a chuckle out of this. There's no way to move from Aurora to RDS short of... 2 minutes of actual work and a lot of waiting around due to the limitations of the hardware?

I get that it's not as easy as a literal button click, but this isn't vendor lock in.

hobs · on May 4, 2023

If you can't stream changes and have to take downtime for a migration then you effectively have vendor lock in if you are serious enough about your database.

Physical Replication should be something any database can offer given its something cross database migrations used decades ago with no problem.

tough · on May 4, 2023

What if you have a really big database? Still 2minutes?

manfre · on May 4, 2023

I'm intentionally ignoring any of the sarcasm in your comment. The time needed for a db dump is always dependent upon the amount of data. This is true regardless of the db software or where it's running.

Ensorceled · on May 4, 2023

I can't think of any project I've worked on where the main data base could be backed up and restored to a different database in "2 minutes"

The sarcasm was warranted.

emodendroket · on May 4, 2023

Isn’t the suggestion that the “active work” is very little and most of the time you’re just waiting?

Ensorceled · on May 4, 2023

Well, the elapsed time is how long I need production to be offline so there are a few more people "just waiting"

emodendroket · on May 4, 2023

Sure, but you don’t really expect to swap databases with a huge data store without any downtime, do you? I’m not aware of any technology that makes that easy.

Ensorceled · on May 4, 2023

Then why are you defending the ridiculous "2 minute" thing?

If the CEO asked how long a migration will take would you respond "2 minutes of engineering time"?

kerkeslager · on May 4, 2023

If you were the CEO and your engineer and someone said, "2 minutes of actual work and a lot of waiting around due to the limitations of the hardware", would you interpret that as "2 minutes of engineering time"?

Obviously I'm going to spend a lot more time communicating the details of the situation to the CEO of a company that is paying me, than I'm going to spend communicating in a Hacker News comment. But as it turns out, no amount of communication is going to be effective if people don't bother to read past the first opportunity they see to jump in with a correction, even if that means stopping reading mid-sentence.

emodendroket · on May 4, 2023

In that situation I would guess the one-click tool doesn’t really handle everything you’d need either so I don’t get what the point of the comparison is.

Ensorceled · on May 5, 2023

I’m not the person would said it was two minutes.

emodendroket · on May 5, 2023

Nor am I but if we go back up the claim was that they were making it much easier to get in than get out, but unless you're telling me that the one-click tool somehow solves all the issues of migrating a large production database that's in use the difference between the two is a minuscule amount of active work.

squeaky-clean · on May 4, 2023

Why would you take production offline to swap databases? Elapsed time is how long you need to run the new database alongside the old one.

Ensorceled · on May 5, 2023

This whole thread is about incompatible databases. If you know a 2 minute solution to a live transfer in that situation I’m all ears.

kerkeslager · on May 4, 2023

If only you had read the whole sentence, you might have saved yourself a bit of righteous anger.

Ensorceled · on May 5, 2023

You seem to be the one angry here.

I'm not sure what I was supposed to take away from your cryptic sentence. Is there a two minute solution to this problem that you are smugly keeping to yourself so you can mock people replying to you?

kerkeslager · on May 5, 2023

My guy, read this whole sentence, which has remained unchanged this entire conversation:

"There's no way to move from Aurora to RDS short of... 2 minutes of actual work and a lot of waiting around due to the limitations of the hardware?"

You seem to be having trouble getting past the word "and", so I've helpfully italicized the part you've repeatedly missed or ignored.

Now sure, that's a bit vague, but if you want more details it might have been advisable to ask a question rather than simply ignoring half the sentence because you don't understand it and jumping in with a correction.

And honestly, even if it's vague on some details, there's no universe in which "2 minutes and a lot of waiting around" = "2 minutes". Whatever vagueness you might accuse me of, that fact isn't vague.

confiq · on May 6, 2023

hey,

that is fair answer! AFAIK, there are two things that consider:

* Aurora do have some vendor-locking feature if I'm not wrong?

* moving from Aurora to PostGres will lead to downtime of 2 minutes + unknown waiting, where this is not a case when you convert from PostGres to Aurora.

Paul-Craft · on May 4, 2023

I think you've failed to consider how impractical even doing a SQL dump can be on a large-ish database, forget about the reloading time.

kerkeslager · on May 4, 2023

I haven't failed to consider that, you've just failed to read where I explicitly mentioned that.

Consider: is that impracticality caused by Amazon creating vendor lock-in? Or is that impracticality caused by the fact that reading terabytes of data from storage, transferring it over the network, and writing it into storage is inherently slow because of the physical limitations of hardware, no matter what vendor you're using?

It's a bit odd for me to be in the position of defending Amazon here. I genuinely don't like them, don't use them, and generally do think they're guilty of creating a lot of vendor lock-in. But this is legitimately not an example of any of that.

Paul-Craft · on May 4, 2023

I didn't miss where you mentioned it. I missed where you considered it. As in, you know, gave any thought to the practicalities involved.

kerkeslager · on May 4, 2023

You're really claiming you can read my mind to know what I have and have not considered right now, despite me bringing it up specifically. Alrighty then.

Paul-Craft · on May 5, 2023

I'm simply going by what you've written here. If you have considered it, then you have not "shown your work," so to speak.

kerkeslager · on May 5, 2023

And if I were being graded that would matter.

As is, I've no responsibility to show you anything, and you're just making unwarranted assumptions about what I have and haven't considered, based on a pretty selective reading of what I've said.

re-thc · on May 4, 2023

With SSO multi-account and 2FA, it'd probably take longer than 2 minutes just to authenticate and configure AWS correctly to get started.

kerkeslager · on May 4, 2023

Yes, still "2 minutes of actual work and a lot of waiting around due to the limitations of the hardware" which is what I said. Perhaps try reading the whole sentence next time?

adql · on May 4, 2023

Kinda depends if you can run replication from aurora to regular database

jtc331 · on May 4, 2023

Aurora Postgres supports bog standard native Postgres logical replication.

Ensorceled · on May 4, 2023

Oh, cool, I wasn't aware you could replicate from Aurora RDS to Postgres RDS. That reduces the risk.

sverhagen · on May 4, 2023

If you talk about no vendor lock in, and you'd want to take your database then to a competitor, like Google Cloud, Azure, or on-premises, wouldn't you exactly expect to do a SQL dump and reloading everything? To me, the one-click move from RDS to Aurora you describe is a nice shortcut, but it doesn't invalidate that you can still do the former if you wanted to move to the competitor. Vendor lock in seems more that you've architected your application against a system that only exists on AWS, like SQS or S3 (although, I guess, competitors offer compatible APIs for some of those, I'm not entirely read up on the state of things there).

tough · on May 4, 2023

Seems a bit of a dark pattern to have shortcut to onboard and not offer the same shortcut to offboard.

Just like easy subscribe-online publications that will have you call during 2 hours with a rep pushing you discounts or whatever to cancel such subscription.

Just not cool

chrisweekly · on May 4, 2023

I'd characterize this asymmetry (convenient shortcut IN, standard export OUT) as a predictable, transparent, minor annoyance -- not a "dark pattern" representing deceptive or unethical practices.

tough · on May 4, 2023

OK. yeah maybe dark pattern was a too strong term for just the annoyance...

Just I'd try to stay away from amazon if I could

CubsFan1060 · on May 4, 2023

This can be done with DMS. https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introd...

Paul-Craft · on May 4, 2023

Huh. I suppose you can. The pricing looks a little opaque though, and the fact that it took 3 hours after I wrote my comment for yours to show up kind of implies it's a bit of an obscure service.

I will also mention that the AWS team we were working with on this didn't mention DMS, and, when directly asked, literally told me there was no easy way to do an Aurora -> RDS migration.

CubsFan1060 · on May 7, 2023

Which is weird. This has been around for a few years, and, at least I thought, was a fairly popular service.

Previously it wasn't great, but since they started using Logical Replication for Postgres, it's gotten far better than it was in the past.

TexanFeller · on May 4, 2023

Aurora significantly modifies the internals of the DBs, particularly the storage layers. It also makes large changes to how memory is used for Postgres. Query plans can be quite different than with the vanilla version. Once you tune and create indexes based on Aurora's characteristics it's going to be a pain to retune for the unmodified version. Aurora also introduces nasty bugs that don't exist on the RDS version such as a memory leak I found was periodically restarting our master. The Postgres team produces highly reliable code, but I don't trust Aurora's hacks on top of it.

confiq · on May 6, 2023

thank you! I needed to read this!

adql · on May 4, 2023

The "different set of services" is so basic it can be ran anywhere, and near-everywhere is cheaper than AWS

Aeolun · on May 4, 2023

I feel like it’s an object lesson in using the right solution for a problem. Step functions do not appear to me to be something that you’d use for things that need to be executed multiple times per second.

jon-wood · on May 4, 2023

Having occasionally looked at them for workflow driven tasks I'm not sure what the use case for Step Functions is, unless your workflow being called once an hour or something they seem infeasibly expensive for what they offer, and somehow manage to be more complex than just writing some code to model the workflow.

GabeIsko · on May 4, 2023

It's a BPM product. If you have a highly regulated workflow that has to be changed a lot by multiple parties, these products start to make sense. The AWS step functions aren't that great in the BPM and workflow automation either, but I imagine AWS just wants to have a first party offering.

Hamuko · on May 4, 2023

Yeah, this is my takeaway too.

I'm pretty happy with the monolith that we run at our business and this seems to validate our decision to stick to that monolith, but I'm also pretty confident that where we use AWS Lambda, serverless is absolutely the right way to go.

For example, I've written a Lambda application to reply to webhook calls and send API calls whenever those come in. It costs maybe $2 per month to run in compute and requests. Would that make more sense to rewrite as a monolith and run on EC2? I really doubt it.

Alacart · on May 4, 2023

In your example, you compare Lambda against a separate monolith for handling the webhooks, but with a monolith wouldn't the comparison be between lambda and just adding a route and controller (or equivalent) to the monolith?

Hamuko · on May 4, 2023

I'm more thinking of rewriting the application (a bunch of Lambda functions + API gateway + random bits and bobs) as a monolith and running that separately on an EC2 instance (or any other VPS).

In this article, they didn't bolt the serverless architecture onto another existing monolith, but rather rewrote the Step Functions and Lambda functions to be a single ECS task.

withinboredom · on May 4, 2023

If you already have those instances for something else, there's rarely a good reason not to cohost. Save $2 a month and the maintenance burden.

t_sawyer · on May 4, 2023

But background tasks are a thing. You could add a webhook endpoint to your monolith that writes a background job. Then your background worker (running on the same ec2 because it’s hardware requirements look pretty low at $2 a month in lambda) runs the job. $2 a month is now $0 since it’s running next to your monolith on the VM you’re already paying for.

steveBK123 · on May 4, 2023

It really depends on the quantity & scale though right? If in my entire estate I have a single shell script that I run - wow lambdas / serverless are amazing.

When I have 200 things that cost $5/mo each to run but fit nicely on a single 8core/32gb ram server.. then this lambda stuff starts to seem crazy expensive right?

raxxorraxor · on May 4, 2023

For some Alexa integrations it is neat and convenient, but I went back to hosting such small interfaces as a service on another server. Not an EC2 instance, just another hosted unmanaged server. They are as cheap as they can get right now.

Hamuko · on May 4, 2023

I'm not really interested in the operational overhead that brings for these small services. Cost-wise they might be just about neck-and-neck, but at least I don't need to worry about the server going down, or having outdated software. Lambda gives me scaling, load balancing and redundancy for that $2.

raxxorraxor · on May 4, 2023

From experience I say that the operational overhead to host on Amazon isn't trivial at all.

I like AWS and I would still recommend it. It saves some work but also creates new stuff to do. Especially if you also want the costs be manageable. Automatic updates, configuring a firewall + reverse proxy with automatic certificate renewals and you favorite deployment mechanism isn't more complicated or labor intensive than managing a small application with AWS. You need to interface it just like software you run on your server.

One of the services I host needed to be authenticated by IP. Happens. You easily get a static IP on AWS for incoming traffic. No problem and cheap too. Now try to get one for the other direction... Possible too, but maintenance just became at least as labor intensive as hosting your own machine. AWS just has to fit your scenario and I think many people overestimate how comparatively easier it has become to host a server with feasible security today. Chances are your databases would be less public than if you skip the AWS documentation.

411111111111111 · on May 4, 2023

> these small services

I think you might be unintentionally arguing with a strawman, as everyone else here is talking about using monoliths instead of that.

Few people want to administer a bunch of micro services themselves, but running a single service on a box is pretty low effort, even if you duplicate it for fail over/redundancy

Hamuko · on May 4, 2023

By "using monoliths", do you mean bolting all code you write into a single runtime, even if they are not the same service? Because that's not what was in this article. Instead, they took Step Functions and a bunch of Lambda functions, and created a brand new monolith from that.

411111111111111 · on May 4, 2023

> define monolith software

In software engineering, a monolithic application describes a single-tiered software application in which the user interface and data access code are combined into a single program from a single platform.

"mono" stands for one/alone/singular, so monolithic is kinda defined to be exactly that, yes.

You can still have multiple monoliths, but they wouldn't communicate with each other and would be entirely separate applications.

Hamuko · on May 4, 2023

I didn't say that my Lambda application communicates with my monolith. Just that I have a monolith and a Lambda application.

motbus3 · on May 4, 2023

I think it is fine. There are scenarios were you need distributed and there are scenarios that you don't.

IMO, distributed software is more practical for working development than for technical reasons.

We all know from basic stuff that performing software comes from single structures that does not require packing and unpacking data But scaling large applications is hard, and it was much more expensive back then. Now that we overreacted to microservices we will overreact to monoliths again. And we will bounce many more times until AI take our jobs and do the loop itself

goodpoint · on May 4, 2023

[flagged]

motbus3 · on May 4, 2023

I do not care about the downvoting it is unfortunate that there is no comment back saying why the person disagrees

goodpoint · on May 4, 2023

That's my point. Silent downvotes.

Jenk · on May 4, 2023

The cynic in me (so like 93% of me) reads this as a "Instead of abandoning AWS altogether, we changed how we use AWS, but most importantly we're still on AWS"

fnordpiglet · on May 4, 2023

As an exaws senior dude we never looked at our service stack as a sell at any cost, but as a continuum of service offerings that could be assembled to be more cost optimal at higher operational burden to (mostly) ops free at a higher premium. The goal was to provide a lego kit of power tools and disappear from view tools. At least in my org we never tried to upsell or convince customers of architectures that accreted revenues at their expense, we tried to honestly assess their sophistication and desire for ops burden and complexity vs cost savings by building it themselves with the lower level kit. By our measure using aws brought us business, and we were generally more motivated by customer obsession over soaking them. I know Andy definitely had that view and drilled it into our collective heads. In many ways as an engineering minded person I appreciated the sentiment as I enjoy solving problems more than screwing people out of their money for sport.

helsinkiandrew · on May 4, 2023

> I wouldn't be surprised if it's taken down at some point ...

Why? they're still using "AWS stuff" - EC2 and ECS etc. Serverless is a fraction of the services AWS offers.

AWS actively promote ways of reducing customers bills. This article could be considered a puff piece for the AWS Compute Savings Plan:

https://aws.amazon.com/savingsplans/compute-pricing/

blowski · on May 4, 2023

Exactly. You could easily frame it as "if AWS seems expensive, you're using it wrong". That an internal team could get it so wrong is testament to how difficult it is to get right, but of course, there's a consultant for helping with that.

dmw_ng · on May 4, 2023

The smoking gun is probably the box that was previously labelled "Media Conversion Service" (Elemental MediaConvert - easily 5-6 figures/mo. for a small amount of snappy on-demand capacity, or crippled slow-as-molasses reserved queues) now labelled "Media Converter" running on ECS. For example, vt1 instances are <$200/mo. spot and each instance packs enough transcode to power a small galaxy, for fine-grained tuning an equivalent CPU-only transcode solution isn't that much more expensive either.

At some point the industry will wake up to the fact the AWS pricing pages are the real API docs, meanwhile dumb shit like this will keep happening over and over again, and AWS absolutely are not to blame for it, any more than e.g. a vendor of cabling is guilty of burning down the house of someone who plugged 10 electric heaters into a chain of double-gang power extension cords

joelhaasnoot · on May 4, 2023

Half of the AWS certifications isn't about what's what but what to use when and using it for the right use case.

datadeft · on May 4, 2023

Exactly right. Most cloud victims are people who have faith instead of cost calculations. DHH & co. are the prime example. It seems even Amazon has such people. I guess hiring is much harder nowadays.

benjaminwootton · on May 4, 2023

That was my reaction too. I know Microservices doesn’t equal cloud, but putting a big monolith on a big server is tangential to AWS interests to say the least!

jasonlotito · on May 4, 2023

> but there's a kind of irony it's come from an internal Amazon team

Not at all. My time working with AWS reps, they never pushed a particular way of doing things. Rather, they tried to make what we wanted to do easier. And the caveat was always to test and make decisions on what was important to us. This isn't an anti-AWS article. Rather, it's exactly the type of thing I'd expect from them. Use the right tool for the right job.

djtango · on May 4, 2023

>Microservices and serverless components are tools that do work at high scale, but whether to use them over monolith has to be made on a case-by-case basis.

Tldr build the right thing.

>"AWS sales and support teams continue to spend much of their time helping customers optimize AWS spend so they can weather this uncertain economy," Brian Olsavsky, Amazon's finance chief, said on a conference call with analysts.[0]

Amazon isn't afraid of this trend, they're embracing it. Better to cannibalise yourself than be disrupted by someone else

https://twitter.com/DanRose999/status/1287944667414196225?s=...

[0] https://www.cnbc.com/2023/04/27/aws-q1-earnings-report-2023....

pauby · on May 4, 2023

It's already on the Wayback Machine https://web.archive.org/web/20230504060528/https://www.prime...

steveBK123 · on May 4, 2023

Yeah this article seems like heresy for someone at Amazon to have written about AWS, no way it lives long.

j45 · on May 4, 2023

Around 2008 the idea of microseconds were looked down on, until they weren’t.

The key is to look down on nothing, become competent with multiple architects and know which ones not to implement in a use case if the one to use isn’t clear right away

seydor · on May 4, 2023

Maybe they'll publish the opposite results in 6 months

goodpoint · on May 4, 2023

And someone else will get promoted.

credit_guy · on May 4, 2023

I don't read it like that at all. Both solutions use the Amazon cloud. Only in one solution you distribute a lot of processes, just because it's possible, and easy to code. When they figured out that rampant distribution was costly, they put more thinking in keeping a lot of computation in the same place (so, "monolith", but still in the cloud). No surprise, they found great savings. If they hadn't, they wold not have written about it. But they had to put some (most likely major) effort into redesigning the application.

emodendroket · on May 4, 2023

I don’t really agree that this somehow exposes those tools as bad. It more shows that they weren’t that well suited for this particular use case.

seanhandley · on May 4, 2023

It's been online for 2 weeks already.

abrookewood · on May 4, 2023

Yep, expect the Lambda team to raise hell.

djtango · on May 4, 2023

Probably an unpopular take and my experience is almost 10 years old, but I would be surprised to see the Amazon I worked at try to bury something like this. If the product isn't what the customer wants, it isn't what the customer wants - move on and build something the customer wants.

Yes agreed there were some funny business like not selling Chromecast, but the guiding principle was generally to make things customers want...

onion2k · on May 4, 2023

Do you think the Lambda team want people to use as many of their services as possible even when it's not actually appropriate and there are better architectures and approaches available? I doubt that. They probably understand that Lambda is a good service for some things and not for others, and using it as a part of deploying things to AWS is a great idea but using it where it doesn't fit makes all of AWS look bad (in particular, hard to use and expensive.)

adql · on May 4, 2023

> Do you think the Lambda team want people to use as many of their services as possible even when it's not actually appropriate and there are better architectures and approaches available?

They most definitely want to as that would most likely mean more money (and promotions) is flowing there.

oblio · on May 4, 2023

Especially now that it's on the frontpage of HN :-)))

datadeft · on May 4, 2023

Why would that be? Somebody figured out that AWS Lambda is not the answer to every single question?

fbn79 · on May 4, 2023

But they migrated to AWS ECS that still is an expensive serverless AWS stuff, just fully managed by Amazon.

nevon · on May 4, 2023

This is simply incorrect. ECS doesn't cost anything other than what you're paying for the EC2 instances that you place your tasks on. Fargate does, but that's not what they're using.

bawolff · on May 4, 2023

I'm pretty convinced that microservices are one of those things that make sense 5% of the time and the other 95% is cargo culting.

The_Colonel · on May 4, 2023

I agree, my intuition would put it to 1% vs. 99% (difficult to quantify of course).

I haven't yet seen a project/product which would need microservice architecture for technical reasons. If you need to scale, you can just scale monoliths (perhaps serving in different roles).

The use case for microservice architecture is IMHO an organizational / high level architecture driven. I've worked in a big company (20K employees) which was completely redesigning its back-office IT solution which ended up as a mesh of various microservices serving various needs (typically consumed by purpose built frontends), worked on by different teams. There monolith didn't make sense, because there was no single purpose, no single product.

But if I'm building a product, I will choose monolith every time. Maaaaybe in some very special cases, I will build some auxiliary services serving the monolith, but there needs to be a very good reason to do so.

boxed · on May 4, 2023

I built a little microservice on the side of my monolith for PDF creation. It used headless chrome and ghostscript to render html to a nice PDF. The problem I had with having that code inside the monolith was that it increased my docker image creation for deploys by a lot. And that code pretty much never changed anyway.

I did feel a bit embarrassed having to make a microservice after having argued against them so much over the years. Hopefully I can stop producing PDFs soon so I can delete the entire thing :P

The_Colonel · on May 4, 2023

That honestly seems like a reasonable use case for (what I call) auxiliary services serving the monolith. As I imagine, there's no real business logic, no data storage / transactions, no authentication/authorization (besides the service being hidden in the private network probably).

foepys · on May 4, 2023

That's exactly what my team did at a former company. We generated reports through a legacy document engine because some customers cannot switch/update their report templates and so we moved the logic out of the monolith into a service to get rid of a large portion of our dependencies.

Moved the monolith to .NET Core, kept the report service on .NET Framework. A win for everybody.

Guid_NewGuid · on May 4, 2023

I feel like it's a semantics thing. To me the meaning of microservices has undergone semantic drift to the antipattern in the article where every little component or database table is its own service with an associated "pizza team".

It's fine to just have plain "services" to do things like this where you need to leverage another OS/framework/whatever and just hive off something like PDF conversion while your core application remains a monolith.

jon-wood · on May 4, 2023

This is absolutely the way to go in my experience. Keep related functionality together, that'll probably result in a big monolith, with maybe a few smaller services orbiting it with very specific roles, or dramatically different traffic patterns. One project I worked on consisted of two monoliths, because we were at the intersection of two business domains, and it didn't make sense to attempt to slap those radically different concepts into one model.

dontlaugh · on May 4, 2023

You can produce multiple docker images from the same codebase quite easily. You can deploy and scale them separately. None of that requires separate repositories or expensive RPC instead of local function calls.

darkwater · on May 4, 2023

What's the difference if they are in one or two repo if they produce two artifacts that are separated? You *will* have network calls between the two, unless you are marrying yourself with a deployment/operational platform that can run the two artifacts together. (ok, there could be a few but I really don't see how this is just using a "monorepo" instead of a "multirepo")

boxed · on May 4, 2023

Who said I have separate repositories?

mike_hearn · on May 4, 2023

Doesn't Docker use cached image layers to solve that? Your PDF rendering could be in one layer that never changes, and the rest goes on top.

franga2000 · on May 4, 2023

The problem is when you have more than one such service. Now when one of them changes, all of them need to be rebuilt. You can solve this with multi-stage builds, but those only work if your build result can be easily copied.

mike_hearn · on May 4, 2023

Is the issue here that images shouldn't be thought of as layers, but rather a tree of cached directory nodes? I don't quite follow what's meant by building here, are you referring to compiling or merging the resulting build artifacts into a final container image?

Non-copyable build outputs sound a bit wild - you're thinking of builds that encode absolute paths into the output binaries?

boxed · on May 4, 2023

In theory yes, but that doesn't work well on github actions for example. The cache layering seems quite bad there for some reason.

pnt12 · on May 4, 2023

I feel you - I worked on a project with pdf generation and the tool to generate it was quite heavy.

manicennui · on May 5, 2023

https://m.signalvnoise.com/the-majestic-monolith-can-become-...

Cthulhu_ · on May 4, 2023

There's also developer push from two sides; developers want to do microservices because it gives them gratification (new problems to solve! new architectures! Rewrite!), and employers want to attract developers (we do microservices! Blockchain! IoT!). So much in software development these days is hype and self-gratification.

henry_bone · on May 4, 2023

Seems like Conway's law applies:-

"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure." -- Melvin E. Conway

https://en.wikipedia.org/wiki/Conway%27s_law

pjc50 · on May 4, 2023

Came here to say this - and it applies in the other direction. Microservices allow you to split work between teams without having to coordinate deployment and iteration cadence quite so tightly.

If you have a single team, you shouldn't be doing microservices.

baobabKoodaa · on May 4, 2023

I mostly see organizations with multiple internal dev teams who all have shared responsibility over all of the microservices (e.g. no-one is responsible for any service). The worst of both worlds: all of the complexity of microservices architecture, without the benefit of specialization and splitting work between teams.

mwcampbell · on May 5, 2023

The organizational structure doesn't have to be reflected in the built artifact(s), though. Just look at Chrome. It has who knows how many teams working on it, but gets built into a single giant DLL (or executable, depending on your platform). And newer languages like Go and Rust make it easier to link everything into one big artifact like this.

12907835202 · on May 4, 2023

Depends if you're counting all the WordPress level projects. Then it'd be 0.01% Vs 99.99% I'd imagine.

But that just puts into perspective how silly this argument is because I have no idea what a project means to other people.

rostos · on May 4, 2023

That organizational/architecture benefit is strong though.

citrin_ru · on May 4, 2023

It’s depends on how you slice services. For one micro-service per a team I see benefits. On another end of the spectrum a single team managing 10-20 micro-services with more than one service per developer. IMHO it creates more problems than solves. Also it is usually a waste of HW resources because a library call it is cheaper then a network request.

whstl · on May 4, 2023

I think it depends on more than that.

A single team managing 10 microservices that actually make sense to be microservices (like the PDF renderer example above [1]) is kinda good and perfectly manageable.

A team with one single microservice that would actually work better if it was part of a monolith is already in the "creates more problems than it solves" territory.

[1] https://news.ycombinator.com/item?id=35812294

The_Colonel · on May 4, 2023

I would frame it as a necessity rather than a benefit. Having siloed teams (services) is usually a problem which is better to avoid as much as you can.

dtech · on May 4, 2023

Having autonomous teams is great for scaling and allowing everyone to go fast, without teams constantly blocking each other.

Having hundreds of engineers work in a single monolith in a single repo without any kind of (enforced) boundaries is a one way ticket to a big ball of mud. You need to invest heavily in tooling to make it work, and e.g. Google does so.

Having a network in between teams is a relatively easy way to enforce boundaries.

The_Colonel · on May 4, 2023

It allows everyone to go fast as long as the work is constrained within one service. It goes very slow once service / team coordination needs to happen and one team alone is not able to deliver the feature. This then often leads to services duplicating logic, amassing responsibilities in order to do as much as possible within "my" service to avoid this coordination bottleneck.

dtech · on May 4, 2023

Then you're either not setting up your team responsibilities right, or you're not allowing cross-team contribution, both are fixable mistakes.

golergka · on May 4, 2023

I'm developing an ML based sideproject. All the modern ML tools are written in python, which is a reality good language for it. However, it is an abysmal project for writing business logic and third party integrations, and if I have some free time I will split the whole thing into one Python and one Typescript service.

pjmlp · on May 4, 2023

The problem is that we are now in the golden age of Web services, where everything is headless and controlled via API, so the business logic plugging all those APIs need to live somewhere.

Naturally it could be a single container taking care of all those integrations.

YetAnotherNick · on May 4, 2023

Microservices works better if you don't trust other team. While having trust seem like a basic thing, this is absolutely not the case for a lot of companies.

With microservices, it is easy to see services which are down or have high error rate or latency, have clear API contract and call out the team for breaking API contract, and assign cost for which the teams have incentive to reduce, or at least not increase it.

__turbobrew__ · on May 4, 2023

Another pain with monoliths is that they can only be deployed if the entire monolith is passing all tests. When you cannot deploy your changes because someone else on an entirely orthogonal team broke something in the monolith which is not related to you it gets old really quick.

Large monolithic repos with many independent targets for testing and deployment work the best at huge scales. If you are only a few hundred engineers, monorepo with monolithic deployments and tests work fine.

withinboredom · on May 4, 2023

I'd ask why people are merging things that break the tests? I worked on a monolith with hundreds of devs and I can count the time the tests failed because someone force-merged something in an emergency on one hand. It was generally unacceptable to merge something when tests failed; you had to have a really good reason.

anonzzzies · on May 4, 2023

Yes, this seems weird; merging breaking code is not an option. The 'breaking team' will have to wait/fix on their side, not us waiting on them for deployment of our working and tested features.

withinboredom · on May 4, 2023

There are some extreme circumstances where pushing broken tests to production make sense. For example, if you push a simple change to simply `return false` and disable a feature in code. In this case, the tests using it will probably fail but the desired behavior happens in production. At this point, you have a bit more time to set the tests to 'skip' while the load is shed in production. Even if you break tests on purpose, you should fix them asap as you are blocking literally every other team in the company. Thus, you need a really good reason to do so (like if you didn't do it asap, global downtime would ensue).

Akronymus · on May 4, 2023

Currently we have the problem of merges of two working branches occasionally resulting in a broken one. How does one solve that?

withinboredom · on May 4, 2023

Don’t merge. Rebase only. Keep a linear history.

When committing, do a ff-only of ‘main’ to your branch. Yes, this forces everyone to rebase before “merging” but in practice, this results in the least amount of failures, tests being run after you resolved any conflicts, etc.

If you can use GitHub merge queues, that solves a ton of this, and you can run tests on the final merge before actually merging instead of relying on rebasing.

whstl · on May 4, 2023

"Don’t merge. Rebase only. Keep a linear history."

This. It makes life so much simpler. With teams that don't have a lot of experience with git, however, I tend to use the "Squash and Merge" feature, coupled with forcing a linear history.

lxgr · on May 4, 2023

Requiring a passing integration branch before merging to master. Merging the integration branch then becomes a fast-forward merge.

Alternatively, if you have a low enough merge volume, requiring mergers (by policy) to squash and rebase (and re-run tests before attempting to merge) can work too, as others have already mentioned.

jmillikin · on May 4, 2023

1. Add a test with a time-bomb (such as a test certificate with 365-day duration), wait a year, and now your test fails without having changed.

2. Add a test with a network dependency, and when that dependency is slow / down / turned off, the test starts failing.

3. Add a dependency on a third-party Github repo that clones from `main`, and the next time some dev touches a file in that repo your test starts failing.

4. Add a test that allocates memory in proportion to size of the codebase (e.g. because it tries to build a giant in-memory tarball of all the .mp4 assets). Eventually it will get flaky when it starts scraping up against the build machine's limit. Extra fun if your builds run without defined memory limits on machines of different sizes.

In a monolithic build, there's all sorts of ways for a single person to cause other teams' tests to fail, even months or years after they've left the company. Some of them can be prevented mechanically (such as by running tests without network access), but a lot come down to "tell them to stop doing that".

That's why big companies never run one build per repo.

withinboredom · on May 4, 2023

Hmm. We never had those specific issues. For (1), we had time bombs for sure, but those usually highlighted coverage issues. Dev ops would disable your test and tell you to fix your code.

IRT (2), network dependencies were forbidden in-general. Over any long enough timespan, the rate of failure is 100%. If you wanted to use the network, you had to consider the failure case and handle it in your tests.

For (3), all dependencies were committed as part of the repo. All dependencies had to be reviewed for any issues before being used, so this made sense. You simply weren’t allowed to just randomly include a new dependency without a review/PR to add it.

For (4), our dev environments had less memory than build machines and the same as production. If you couldn’t build it in a dev environment, it wasn’t getting committed without special treatment from dev ops (and a really good reason).

jmillikin · on May 4, 2023

Yeah, all of those mitigations don't work as well when you've got thousands of engineers whose work would be blocked if some intern's badly written test blows up.

A monolithic build means that your ability to develop and deploy your team's code is dependent on every other team. As the number of teams gets larger, that multiplier really hurts.

withinboredom · on May 4, 2023

It's a learning experience. If everyone learns from it, it probably (most likely) won't happen again. Everyone learns how to write better tests. And, like I said, if you absolutely need to merge something right this exact second and it can't wait until someone disables the failing test (or you can't do it yourself for some reason), you can always merge it even with failing tests.

__turbobrew__ · on May 4, 2023

Average tenure at most companies is 2/3 years. You cannot rely on people learning from mistakes because there are always new people, you need to make it so they cannot make mistakes or if they make a mistake it isn’t going to block the whole company from executing.

withinboredom · on May 4, 2023

Organizations learn, it gets embedded in the culture, tooling, and automation. I shared some of the rules that were embedded in our organization via style guides and onboarding.

There were a thousand automated checks to prevent you from doing the same thing as someone else that caused downtime in the past. It was virtually impossible to commit code that deleted/truncated a table, for example.

adql · on May 4, 2023

>Another pain with monoliths is that they can only be deployed if the entire monolith is passing all tests. When you cannot deploy your changes because someone else on an entirely orthogonal team broke something in the monolith which is not related to you it gets old really quick.

You can just... not allow code not passing test into master branch. They can fuck around in their own one, that's what branches are for

__turbobrew__ · on May 4, 2023

Tests can pass before they are merged but fail after merge. Tests can also regress over time and become flaky.

citrin_ru · on May 4, 2023

1. Monolithic app and VCS mono-repo are orthogonal and can be mixed in different combinations (with different tradeoffs).

2. How about an old way of blocking deployments less: parts of this monolithic apps developed as libraries with a stable API, a new version of a library released only after its own tests has passed, next you can increment dependency version in the monolith and run integration tests, if they failing you can revert to the old library version and still go ahead with the deployment (if you don't depend on something added in the latest library release). If you depend on this new feature and a component providing it is broken micro-services would not help you.

> If you are only a few hundred engineers, monorepo with monolithic deployments and tests work fine.

And here lies a very important problem IMHO - many (if not majority) of organizations (which do at least some software development) have less than 100 software developers but the industry best practices (which include micro-service architecture) are defined by FAANG-sized organizations and at least some of these practices are sub-optimal for small shops.

exitheone · on May 4, 2023

It's not only tests, a rollback because your pdf generation has a bug will also mean rolling back for example a customer facing new API, slowing down the API team until the monolith is fixed or the change reverted and rebuilt.

bottlepalm · on May 4, 2023

Tests passing should be a gate to merge into main. Tests can also be run in parallel.

kubanczyk · on May 4, 2023

Two green changes merged to a green main can produce a red main.

This is not a commonly known fact. Just to take an example of GitHub, this check is disabled by default:

> Require branches to be up to date before merging

> Whether pull requests targeting a matching branch must be tested with the latest code.

byroot · on May 4, 2023

> Require branches to be up to date before merging

Because this really doesn't scale well with the number of developers.

For larger teams the solution is to use a merge queue, e.g. https://shopify.engineering/successfully-merging-work-1000-d...

I haven't tried it, but GitHub is now offering such feature in public-beta: https://github.blog/changelog/2023-02-08-pull-request-merge-...

arp242 · on May 4, 2023

> Two green changes merged to a green main can produce a red main.

I'm sure this happens occasionally, but I've never experienced it, and it seems to be rare enough that it's not that big of a concern. Especially since it'll be easily remedied by either just fixing the error or just reverting one or both of the changes.