> rust-query manages migrations and reads the schema from the database to check ...

kelnos · 2024-11-30T20:58:23 1733000303

It's a bummer that you've been downvoted, because it really does seem like people here have not operated databases at scale.

I will never claim that we were great at managing databases at Twilio, but often a schema change would take hours, days, or even a week or two to complete. We're taking about tables with hundreds of millions of rows, or more.

We'd start the change on a DB replica. When it would finish, we would have to wait for the replica to catch up with the primary. Then we would bring up new replicas, replicating from the replica with the new schema. Finally that replica would get promoted to primary, with all the old replicas (and the old primary, of course) removed from service, and the new replicas brought in.

Only then could we deploy code that was aware of and used the updated schema. The previous code of course had to ignore unknown columns, and if we ever wanted to drop a column, we had to first deploy code that would stop using that column. Any column type changes would need to be backwards-compatible. If that wasn't possible, we'd have to add a new column and backfill it. Adding indexes would usually be fine without preparatory code changes, but if we wanted to drop an index we'd first have to make sure there were no queries still depending on it.

Even for a "small" schema change that "only" took minutes or a few tens of seconds to complete, we'd still have to use this process. What, do you think we'd shut part or all of a real-time communications platform down while we do a schema change? Of course not.

The idea that the application could or should be in control of this process, or could always be in sync with the database when it came to its understanding of the schema, is impossibly unrealistic.

mjr00 · 2024-11-30T21:18:19 1733001499

Yep, sounds like we have similar experiences! I first had to start thinking about this stuff at Hootsuite, back in the exciting 1million+ DAU days a decade ago. Before then, to me databases were just a thing that got deployed along with the application, and deploys only happened on a Friday night so who cares about downtime? By the time anyone tries logging into the app on Monday morning, the code and database will all be up to date. Going to a place where deploys were happening constantly and nonzero downtime was unacceptable was eye-opening.

> The idea that the application could or should be in control of this process, or could always be in sync with the database when it came to its understanding of the schema, is impossibly unrealistic.

These days my attitude is to treat databases as a completely separate service from the application code, which they effectively are. They're on a different set of servers, and the interface they provide is the columns/tables/views/etc, accessed through SQL. So yeah, no breaking changes, and the only thing application code should care about is if the queries it tries to execute return the expected sets of data, not if the schema itself matches. And certainly not about things like views, triggers or indexes.

This does end up being more overhead than migrations alongside the application code, which I know a lot of developers prefer because they're easier to use, but the approach just doesn't work after a certain scale.

(to be clear, I still use Liquibase etc to manage migrations, the process for applying those changes is just completely separate from deploying application code.)

kelnos · 2024-11-30T22:17:43 1733005063

> These days my attitude is to treat databases as a completely separate service from the application code, which they effectively are. They're on a different set of servers, and the interface they provide is the columns/tables/views/etc, accessed through SQL.

I've never thought of it this way, but I think this is really smart. If I have a service that exposes a REST API, I can, say, add a new field to a JSON object that's returned from an API endpoint without telling clients about it. Those clients can update later in order to take advantage of the information returned in the new field.

Same thing with a database: I can add a new column, and clients can learn about the new column later in the future, no problem. The database schema is just a part of the database's API, and it can be evolved in a backwards-compatible manner just like any other API.

> to be clear, I still use Liquibase etc to manage migrations, the process for applying those changes is just completely separate from deploying application code.

Right, the schema needs to be managed and there needs to be a source of truth for it, with tooling to do migrations, but coupling that so closely with the application so the schema and application always must be in sync (like some others seem to think is the One True Way) is a mistake, and would be a complete non-starter for my past professional needs.

jessekv · 2024-12-01T10:45:24 1733049924

IMO its a similar situation to the discussion here:

https://capnproto.org/faq.html#how-do-i-make-a-field-require...

dathinab · 2024-12-01T00:56:35 1733014595

I had to realize that at least in start up world most (non db focused) devs thing they might not be experts in SQL but have a very solid understanding

... and then don't know about a lot of very fundamental important parts and are blissfully unaware about that, too.

And to be clear I'm not saying they don't remember the exact details of something.

What I mean they don't even know that there are things they have to look up, nor any experience or willingness to understand what they did wrong by consulting the official documentation instead of just randomly googling and trying out "solutions" until one seem to happen to work.

The most common example would be having so little understanding about transaction that they believe transactions are just magically fixing all race conditions, and then then being very surprised that they don't. Or believing that transactions in SQL are fundamentally broken after realizing that somehow their databases got corrupted.

And again I don't mean junior deves, but people with 10+ years of backend or "fullstack" experience, i.e. people which at least should know that when to consult documentation/lookup protections transactions provide etc.

I have seen more then one time a (final state of) the situation where people started with believing SQL transaction magically fix everything, then get "corrupted" data then blame SQL for being broken and move to NoSql.

The joke here is all the concurrency problem are very fundamental and independent of SQL vs. NoSQL.

And SQL often gives you more powerful/easy to use (at small scale) tools to enforce synchronization, but at a cost. While NoSQL often gives you harder to use primitives where you have to do much more outside of the database to guarantee correctness, but then at least you will more likely blame you code instead of the db for things not working.

The most ironic thing here is I'm not a db expert, I just know where my knowledge stops and where I can lookup the missing parts and can't even give you much tips about huge dbs in production luckily surprisingly many companies have comparatively "small" db needs.

And honest where I see race condition related issues in SQL quite often I'm rarely not seeing them in NoSQL code. Where this issues in SQL make me sad as they are often very avoidable in NoSQL I often feel like giving up in resignation.

Through that experience is for "smallish" databases not Twillo scale. But a surprising large amount of companies have surprisingly "smallish" databases. Like no joke I have seen companies being very vocal about their "huge database" and then you realize it's like 5GiB ;=)

Honestly I (metaphorically speaking) don't even want to know how db experts feel about this, I'm not a db expert and just have a solid enough foundation to know where my knowledge stops and when I have to look things up (which is all the time, because I'm not writing that much SQL).

threeseed · 2024-11-30T20:25:48 1732998348

> Even if you have a great deployment pipeline, at some point, you'll need to add a missing index in production fast because a wave of users came in and revealed a shit query.

This sounds more like a CI/CD and process issue.

There is no reason why adding a new index in code and deploying it into Production should be more complex or error prone than modifying it on the database itself.

mjr00 · 2024-11-30T20:33:54 1732998834

Direct execution of `CREATE INDEX...` on a database table is always going to be faster than going through a normal deployment pipeline. Even if we assume your pipeline is really fast, which is probably not the case at most orgs, you are still comparing a single SQL statement execution, to a single SQL statement execution + git push + code reviews + merge + running through Jenkins/Circle/whatever. How long does that overhead take? How much money have you lost because your website won't load when your post is on the frontpage of HN? Seconds and minutes count. I don't want my code crashing because an unexpected index exists in this scenario.

threeseed · 2024-11-30T20:38:18 1732999098

You should be able to deploy end to end to Production in less than a minute.

Companies should be focused on solving that problem first before doing insanely short-sighted workarounds like skipping pushing to Git and code reviews.

mjr00 · 2024-11-30T20:43:15 1732999395

> You should be able to deploy end to end to Production in less than a minute.

When I was at AWS (RDS) our end-to-end production deployment process was 7 days. We were also pulling $25million/day or so in profit. I'm sure that number is much higher now.

There's a large difference between what the theoretical "right" thing is from a textbook perspective, and what successful engineering teams do in reality.

edit: besides, it doesn't even make sense in this context. I have 100 servers talking to the database. I need to create an index, ok, add it to the code. Deploy to server 1. Server 1 adds the index as part of the migration process, and let's say it's instant-ish (not realistic but whatever). Do the other 99 servers now panic because there's an unexpected index on the table?

jeltz · 2024-11-30T20:52:50 1732999970

I don't think I have ever seen a non-toy project where that was the case.

kelnos · 2024-11-30T20:50:27 1732999827

That's a lovely ideal, but I'm the real world, there are relatively few companies that meet that metric.

threeseed · 2024-11-30T21:02:10 1733000530

I've worked at FAANG and enterprise companies and we managed to do it.

There are no technical reasons why it can't be done. Only process and will.

kelnos · 2024-11-30T22:09:45 1733004585

Yes, and that's exactly the point. The reality doesn't usually match the ideals, and many orgs do not have good process, and do not have the political will to get good process implemented. Part of being a professional is recognizing where reality falls short of the ideals (an all-too-common occurrence), and doing the best you can to successfully get your work done in that environment.

And of course I don't know which FAANGs you worked at, but I know folks at FAANGs who have complained to me about CI and deployment times. Hell, these are huge companies; while they try to harmonize tooling, deployment times (especially when test suites of varying quality are involved) can vary a lot across a company. I wouldn't be surprised if there were people at the companies you worked at that were upset with deployment times, even if the teams you worked on were in good shape.

Honestly, when someone suggests something like you've suggested (that everyone should be able to get their deployment times to under a minute), I really do wonder if they're intentionally arguing in bad faith or are trolling. I know for a fact that things are not that rosy, and are rarely that rosy, even at the companies you claim to have worked at, and it's hard to believe that anyone could genuinely think that this is a broadly-attainable target. That doesn't mean that no one can do it, but that does mean that designing tooling that assumes everyone can do it is... well, just kinda naive and not very useful.

threeseed · 2024-11-30T23:46:56 1733010416

You have two choices: (1) try and solve your deployment issues or (2) make unmanaged, untested, unreviewed changes directly in Production.

Now you may say I'm just trolling but option (1) seems better to me for the long-term health of the project/company. And I don't believe it's correct to say it is an unrealistic goal.

swiftcoder · 2024-12-01T10:32:36 1733049156

> I've worked at FAANG and enterprise companies and we managed to do it.

You have a very different experience to the rest of us, in that cases.

The big AWS services all had deployments measured in days - or even weeks (depending on how many regions they are deployed across). Facebook's monorepo took upwards of an hour just to get a PR through the merge queue. Both were notorious for "hand-jamming" critical fixes directly to production.

pclmulqdq · 2024-12-01T01:00:42 1733014842

There are lots of reasons to do slow rollouts. You should be rolling out in stages anyway.

tempodox · 2024-11-30T21:57:17 1733003837

You do code review in less than a minute?

threeseed · 2024-11-30T23:48:24 1733010504

You do code reviews/automated testing in lower environments before you make changes directly to Production.

And in this case if it's an emergency hot fix then it's still better to do this through a managed, tracked, tested pipeline.

clutchski · 2024-12-01T12:59:02 1733057942

This is the correct answer.

jessekv · 2024-12-01T10:28:37 1733048917

IMO "parse, don't validate" can apply to data coming out of the database too.