Hacker News new | past | comments | ask | show | jobs | submit login

> rust-query manages migrations and reads the schema from the database to check that it matches what was defined in the application. If at any point the database schema doesn't match the expected schema, then rust-query will panic with an error message explaining the difference (currently this error is not very pretty).

IMO - this sounds like "tell me you've never operated a real production system before without telling me you've never operated a real production system before."

Shit happens in real life. Even if you have a great deployment pipeline, at some point, you'll need to add a missing index in production fast because a wave of users came in and revealed a shit query. Or your on-call DBA will need to modify a table over the weekend from i32 -> i64 because you ran out of primary key values, and you can't spend the time updating all your code. (in Rust this is dicier, of course, but with something like Python shouldn't cause issues in general.) Or you'll just need to run some operation out of band -- that is, not relying on a migration -- because it what makes sense. Great example is using something like pt-osc[0] to create a temporary table copy and add temporary triggers to an existing table in order to do a zero-downtime copy.

Or maybe you just need to drop and recreate an index because it got corrupted. Shit happens!

Anyway, I really wouldn't recommend a design that relies on your database always agreeing with your codebase 100% of the time. What you should strive for is your codebase being compatible with the database 100% of the time -- that means new columns get added with a default value (or NULL) so inserts work, you don't drop or rename columns or tables without a strict deprecation process (i.e. a rename is really add in db -> add writes to code -> backfill values in db -> remove from code -> remove from db), etc...

But fundamentally panicking because a table has an extra column is crazy. How else would you add a column to a running production system?

[0] https://docs.percona.com/percona-toolkit/pt-online-schema-ch...




It's a bummer that you've been downvoted, because it really does seem like people here have not operated databases at scale.

I will never claim that we were great at managing databases at Twilio, but often a schema change would take hours, days, or even a week or two to complete. We're taking about tables with hundreds of millions of rows, or more.

We'd start the change on a DB replica. When it would finish, we would have to wait for the replica to catch up with the primary. Then we would bring up new replicas, replicating from the replica with the new schema. Finally that replica would get promoted to primary, with all the old replicas (and the old primary, of course) removed from service, and the new replicas brought in.

Only then could we deploy code that was aware of and used the updated schema. The previous code of course had to ignore unknown columns, and if we ever wanted to drop a column, we had to first deploy code that would stop using that column. Any column type changes would need to be backwards-compatible. If that wasn't possible, we'd have to add a new column and backfill it. Adding indexes would usually be fine without preparatory code changes, but if we wanted to drop an index we'd first have to make sure there were no queries still depending on it.

Even for a "small" schema change that "only" took minutes or a few tens of seconds to complete, we'd still have to use this process. What, do you think we'd shut part or all of a real-time communications platform down while we do a schema change? Of course not.

The idea that the application could or should be in control of this process, or could always be in sync with the database when it came to its understanding of the schema, is impossibly unrealistic.


Yep, sounds like we have similar experiences! I first had to start thinking about this stuff at Hootsuite, back in the exciting 1million+ DAU days a decade ago. Before then, to me databases were just a thing that got deployed along with the application, and deploys only happened on a Friday night so who cares about downtime? By the time anyone tries logging into the app on Monday morning, the code and database will all be up to date. Going to a place where deploys were happening constantly and nonzero downtime was unacceptable was eye-opening.

> The idea that the application could or should be in control of this process, or could always be in sync with the database when it came to its understanding of the schema, is impossibly unrealistic.

These days my attitude is to treat databases as a completely separate service from the application code, which they effectively are. They're on a different set of servers, and the interface they provide is the columns/tables/views/etc, accessed through SQL. So yeah, no breaking changes, and the only thing application code should care about is if the queries it tries to execute return the expected sets of data, not if the schema itself matches. And certainly not about things like views, triggers or indexes.

This does end up being more overhead than migrations alongside the application code, which I know a lot of developers prefer because they're easier to use, but the approach just doesn't work after a certain scale.

(to be clear, I still use Liquibase etc to manage migrations, the process for applying those changes is just completely separate from deploying application code.)


> These days my attitude is to treat databases as a completely separate service from the application code, which they effectively are. They're on a different set of servers, and the interface they provide is the columns/tables/views/etc, accessed through SQL.

I've never thought of it this way, but I think this is really smart. If I have a service that exposes a REST API, I can, say, add a new field to a JSON object that's returned from an API endpoint without telling clients about it. Those clients can update later in order to take advantage of the information returned in the new field.

Same thing with a database: I can add a new column, and clients can learn about the new column later in the future, no problem. The database schema is just a part of the database's API, and it can be evolved in a backwards-compatible manner just like any other API.

> to be clear, I still use Liquibase etc to manage migrations, the process for applying those changes is just completely separate from deploying application code.

Right, the schema needs to be managed and there needs to be a source of truth for it, with tooling to do migrations, but coupling that so closely with the application so the schema and application always must be in sync (like some others seem to think is the One True Way) is a mistake, and would be a complete non-starter for my past professional needs.


IMO its a similar situation to the discussion here:

https://capnproto.org/faq.html#how-do-i-make-a-field-require...


I had to realize that at least in start up world most (non db focused) devs thing they might not be experts in SQL but have a very solid understanding

... and then don't know about a lot of very fundamental important parts and are blissfully unaware about that, too.

And to be clear I'm not saying they don't remember the exact details of something.

What I mean they don't even know that there are things they have to look up, nor any experience or willingness to understand what they did wrong by consulting the official documentation instead of just randomly googling and trying out "solutions" until one seem to happen to work.

The most common example would be having so little understanding about transaction that they believe transactions are just magically fixing all race conditions, and then then being very surprised that they don't. Or believing that transactions in SQL are fundamentally broken after realizing that somehow their databases got corrupted.

And again I don't mean junior deves, but people with 10+ years of backend or "fullstack" experience, i.e. people which at least should know that when to consult documentation/lookup protections transactions provide etc.

I have seen more then one time a (final state of) the situation where people started with believing SQL transaction magically fix everything, then get "corrupted" data then blame SQL for being broken and move to NoSql.

The joke here is all the concurrency problem are very fundamental and independent of SQL vs. NoSQL.

And SQL often gives you more powerful/easy to use (at small scale) tools to enforce synchronization, but at a cost. While NoSQL often gives you harder to use primitives where you have to do much more outside of the database to guarantee correctness, but then at least you will more likely blame you code instead of the db for things not working.

The most ironic thing here is I'm not a db expert, I just know where my knowledge stops and where I can lookup the missing parts and can't even give you much tips about huge dbs in production luckily surprisingly many companies have comparatively "small" db needs.

And honest where I see race condition related issues in SQL quite often I'm rarely not seeing them in NoSQL code. Where this issues in SQL make me sad as they are often very avoidable in NoSQL I often feel like giving up in resignation.

Through that experience is for "smallish" databases not Twillo scale. But a surprising large amount of companies have surprisingly "smallish" databases. Like no joke I have seen companies being very vocal about their "huge database" and then you realize it's like 5GiB ;=)

Honestly I (metaphorically speaking) don't even want to know how db experts feel about this, I'm not a db expert and just have a solid enough foundation to know where my knowledge stops and when I have to look things up (which is all the time, because I'm not writing that much SQL).


> Even if you have a great deployment pipeline, at some point, you'll need to add a missing index in production fast because a wave of users came in and revealed a shit query.

This sounds more like a CI/CD and process issue.

There is no reason why adding a new index in code and deploying it into Production should be more complex or error prone than modifying it on the database itself.


Direct execution of `CREATE INDEX...` on a database table is always going to be faster than going through a normal deployment pipeline. Even if we assume your pipeline is really fast, which is probably not the case at most orgs, you are still comparing a single SQL statement execution, to a single SQL statement execution + git push + code reviews + merge + running through Jenkins/Circle/whatever. How long does that overhead take? How much money have you lost because your website won't load when your post is on the frontpage of HN? Seconds and minutes count. I don't want my code crashing because an unexpected index exists in this scenario.


You should be able to deploy end to end to Production in less than a minute.

Companies should be focused on solving that problem first before doing insanely short-sighted workarounds like skipping pushing to Git and code reviews.


> You should be able to deploy end to end to Production in less than a minute.

When I was at AWS (RDS) our end-to-end production deployment process was 7 days. We were also pulling $25million/day or so in profit. I'm sure that number is much higher now.

There's a large difference between what the theoretical "right" thing is from a textbook perspective, and what successful engineering teams do in reality.

edit: besides, it doesn't even make sense in this context. I have 100 servers talking to the database. I need to create an index, ok, add it to the code. Deploy to server 1. Server 1 adds the index as part of the migration process, and let's say it's instant-ish (not realistic but whatever). Do the other 99 servers now panic because there's an unexpected index on the table?


I don't think I have ever seen a non-toy project where that was the case.


That's a lovely ideal, but I'm the real world, there are relatively few companies that meet that metric.


I've worked at FAANG and enterprise companies and we managed to do it.

There are no technical reasons why it can't be done. Only process and will.


Yes, and that's exactly the point. The reality doesn't usually match the ideals, and many orgs do not have good process, and do not have the political will to get good process implemented. Part of being a professional is recognizing where reality falls short of the ideals (an all-too-common occurrence), and doing the best you can to successfully get your work done in that environment.

And of course I don't know which FAANGs you worked at, but I know folks at FAANGs who have complained to me about CI and deployment times. Hell, these are huge companies; while they try to harmonize tooling, deployment times (especially when test suites of varying quality are involved) can vary a lot across a company. I wouldn't be surprised if there were people at the companies you worked at that were upset with deployment times, even if the teams you worked on were in good shape.

Honestly, when someone suggests something like you've suggested (that everyone should be able to get their deployment times to under a minute), I really do wonder if they're intentionally arguing in bad faith or are trolling. I know for a fact that things are not that rosy, and are rarely that rosy, even at the companies you claim to have worked at, and it's hard to believe that anyone could genuinely think that this is a broadly-attainable target. That doesn't mean that no one can do it, but that does mean that designing tooling that assumes everyone can do it is... well, just kinda naive and not very useful.


You have two choices: (1) try and solve your deployment issues or (2) make unmanaged, untested, unreviewed changes directly in Production.

Now you may say I'm just trolling but option (1) seems better to me for the long-term health of the project/company. And I don't believe it's correct to say it is an unrealistic goal.


> I've worked at FAANG and enterprise companies and we managed to do it.

You have a very different experience to the rest of us, in that cases.

The big AWS services all had deployments measured in days - or even weeks (depending on how many regions they are deployed across). Facebook's monorepo took upwards of an hour just to get a PR through the merge queue. Both were notorious for "hand-jamming" critical fixes directly to production.


There are lots of reasons to do slow rollouts. You should be rolling out in stages anyway.


You do code review in less than a minute?


You do code reviews/automated testing in lower environments before you make changes directly to Production.

And in this case if it's an emergency hot fix then it's still better to do this through a managed, tracked, tested pipeline.


This is the correct answer.


IMO "parse, don't validate" can apply to data coming out of the database too.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: