Fully agree. At that level, the justification for using MongoDB usually boils down to not wanting to deal with table schemas or SQL. In both cases, there are better alternatives.
In most cases you're not even removing the concept of the schemas, your just moving them from your database to your application. Which in a sense is worse, especially if multiple applications needs to access the same database.
As for SQL, I have yet to see something solve the problem for querying a database in a simpler manor.
I’ve heard stuff like this from supposedly senior people - if we use mongo we can just store anything, we don’t need to think about a schema (also you can only store short strings in an SQL database)
But typically, if you do spend some time to think about schema, and relationships, you only have to spend a small amount of time, and map that into the appropriate place, and then usually you don't have to mess with it again. It's worth the investment.
I don’t. How are new grads supposed to learn the ups and downs of different choices they make? Just being told they’re led astray in a blog post isn’t gonna work - it’ll backfire.
I used node as a new grad for things it wasn’t meant for and that’s how I learned what it is good at and what it isn’t.
Maybe it's just me, but I think it's purely rational?
A lot of these OSS projects provided a commercial offering in the form of SaaS. Yet AWS/GCP/Azure can just take the OSS project, not contribute anything, and reap all the profit.
AFAICS, these licenses are only intended to defend against the cloud providers, not against companies just using the product commercially and internally.
Amazon, Google, and Microsoft can do anything, snuffing out some mid company is the least of their crimes.
> not contribute anything, and reap all the profit
Yeah, it's called capitalism.
What's their stance on Lina Khan breaking up the monopolies? Have they written letters criticising Reid Hoffman for pressuring Kamala?
I'll give you one counter-argument, though: maybe it's hard to appreciate all of the problems a traditional SQL RDBMS solves until you try and solve them without an RDBMS... and crash and burn badly.
But, if they're actually building a real product with real funding money and they only know MongoDB... yeah, it's intervention time.
They haven't though. What's wrong with using a tool even if it might be bad? Especially as a fresh user. It's how we learn. From both good and bad experiences.
> They need help.
Sadly it's not the fresh grad, but the "experienced" that only keep their old experiences that need help. Is this comment from 2010? MongoDB has improved. Maybe not to the point of being the best but definitely not unusable.
> Especially as a fresh user. It's how we learn. From both good and bad experiences.
I'd do that, but a superior strategy is letting other people make mistakes and then learning from them. It is best to always be making choices that seem like they could be optimal, with very rare exceptions.
I agree. But unless you are at least tangentially involved, you are likely not to hear such „gone bad“ stories.
First, people are not nearly as self reflective or admit to failures at all. And second, a bad tech decision might not be directly observable (as you will then often see ppl fighting symptoms rather than change and identify the root cause).
> If this was a superior strategy that was so obvious no 1 would be making mistakes so how does this work?
Either people aren't aware of an optimal strategy (if one exists) or they ignore it for various reason.
The latter is surprisingly common. People know they should exercise, get enough sleep, eat healthy, stay hydrated, tackle high priority tasks instead of procrastinating, etc. - and yet they still aren't doing those things (or as much as they should).
> Either people aren't aware of an optimal strategy (if one exists) or they ignore it for various reason.
And my point was precisely that it often doesn't exist. What is an optimal strategy?
> People know they should exercise, get enough sleep, eat healthy, stay hydrated, tackle high priority tasks instead of procrastinating, etc. - and yet they still aren't doing those things (or as much as they should).
> Knowing something is not enough.
Your example doesn't even support this. People know they have to get enough sleep but they also know they have to <insert something else>. It depends on what they are optimizing for. i.e. there is no singular optimal strategy.
You've just proved my point rather than yours.
I don't know if knowing is not enough, but clearly the people in your example don't know what the optimal strategy is. E.g. eating healthy is NOT the optimal strategy as it could make them unhappy (e.g. don't like the taste). It's not optimal unless it's strictly better.
> If this was a superior strategy that was so obvious no 1 would be making mistakes so how does this work?
I agree with duckmysick, and also please take note that having a strategy of not making mistakes will not avoid all mistakes. Outcomes and intent never match up perfectly. But that is why it is important to learn from others right from the start.
MongoDB salesdroids rely heavily on you having a low familiarity with other database tech to spin themselves as the only game in town. "Being led astray" isn't a passive, ambient occurence, and it makes sense to push back against it.
On the other hand I know of multiple databases where all tables had attribute_1, attribute_2,..., attribute_5 columns Just in Case™
But more seriously the one feature I like in MongoDB is the pipeline API, where you can express a complex query with multiple filters/aggregations/transformations/joins as a list of simple steps.
There are some use cases where it is very ergonomic (even if I suspect that mongo can easily lose indexes along the steps so pretty performance might not be super intuitive)
I forgot about what I hate about mongo, so I will rant here about it: its data model is close enough but different enough from json to be both annoying and dangerous.
The empty string is a valid json key but not a mongo document key.
Mongo uses $operator keys to serialize its datatypes to json but does not sanitise the result: which means that {"foo":100000000000000000} and {"foo":{"$longInteger":"100000000000000000"}} will have a collision with the json export format. (Even if you choose the fully explicit Canonical format as there is no $document operator to wrap ambiguous documents)
So if your plan is to dump json to mongo you should plan for that (also sometimes $operators are evaluated sometimes they are not, it depends on each method and the documentation does not tell you )
The official client (both csv and json) is unable to export a collection if a field is both a value field both an atomic value and an object, so a collection with two documents:
{a:1} and {a:{b:1}} will cause problems of you try to export it.
My colleagues have other issues with the json DSL and how most operators exist in 2-3 different forms with different syntax or how the syntax {$operator:{arg1:..., arg2:...}} is unintuitive but I actually sort of like it.
> You know exactly what your app needs to do, up-front
No one does. Mongodb still perfectly fits.
> You know exactly what your access patterns will be, up-front
This one also no one knows when they start. We successfully scaled MongoDB from a few users a day to millions of queries an hour.
> You have a known need to scale to really large sizes of data
This is exactly a great point. When data size goes to a billion rows, Postgres is tough. MongoDB just works without issue.
> You are okay giving up some level of consistency
This is said for ages about MongoDB. Today, it provides very good consistency.
> This is because this sort of database is basically a giant distributed hash map.
Putting MongoDB in category of Dynamo is a big mistake. It's NOT a giant distributed hash map.
> Arbitrary questions like "How many users signed up in the last month" can be trivially answered by writing a SQL query, perhaps on a read-replica if you are worried about running an expensive query on the same machine that is dealing with customer traffic. It's just outside the scope of this kind of database. You need to be ETL-ing your data out to handle it.
This shows the author has no idea how MongoDB aggregation works.
I don't want fresh grads to use SQL just because they learn relations (and consistency and constraints and what not). It's perfectly fine to start on MongoDB and make it the primary DB.
I have more mongodb experience than postgres but my impression is that a lot of the json handling I ended up doing in mongo would have been easier/reliable in postgres
Yes, you obviously can fit 10 TB of data onto a developer laptop with 10tb of external storage. However you’ll run out of disk before you run into Postgres issues.
> This is exactly a great point. When data size goes to a billion rows, Postgres is tough. MongoDB just works without issue.
Personally, I've not seen any application that seriously needs a billion rows in a single table. (except at truly massive scale, but then you're not using Mongo)
The real solution is implementing archiving to a file store like S3 and/or ship it off to a data warehouse. You don't need billions of rows in a `record_history`/`user_audit` table going back 5 years in your production database. Nobody queries the data.
Disclaimer: I have a lot more experience with postgres than mongo. I have worked with multi billion row databases in postgres. I have not on mongo.
> When data size goes to a billion rows, Postgres is tough. MongoDB just works without issue.
Joins are tough at a billion rows in Postgres. PK lookups and simple index queries of the type mongo is good at Postgres is generally good at too. The main thing mongo has over postgres is ease of sharding if one is looking to scale horizontally.
> We successfully scaled MongoDB from a few users a day to millions of queries an hour.
Uh, 1 query per second is 60x60x60=216000... Soo, 1 million queries per hour equals 4-5 queries per second.
Soo, that's not even at toy project level. That's extremely low scale, like the smallest possible instance small.
A consumer laptop does 20+k queries/seconds on postgres, mysql etc. a raspberry pi usually still gets 1-3k read queries/s, depending on the used SD card
(Or 432 million queries per second).
You're not instilling any kind of confidence quoting numbers like that
TBH we are at around 15M queries per hour. I am sure our customers don't want us to run on RPi. Btw, it's not only query but billion+ rows which are also there.
I expected as much, usually it's me pointing out that mongodb is a decent DB depending on the data you're ingesting/storing, and it's builtin clustering is significant better then what postgres offers at the moment.
But the number was so low I couldn't help but point out that this was more likely to convince me that mongo is a joke then a usable database
The "SQLite is just a file" thing is actually an advantage. The example of a website is actually a pretty poor one, since any website that needs to scale beyond a single box has many options. The two easiest ones are:
- Mix static and dynamic content generation (and let's face it, most websites are mostly static from a server perspective)
- Designate a writer node and use any of the multiple SQLite replication features
But, in short, if you use an ORM that supports both SQLite and Postgres you'll have the option to upgrade if your site brings in enough traffic. Which might never happen, and in that case you have a trivial backup strategy and no need to maintain, secure and tweak a database server.
You don't need to maintain, secure and tweak postgres any more than you would with SQLite. Just install it and it'll work. Postgres backup is a single command. And actually you're supposed to create sqlite backups with special command as well, if you're copying a file, you're doing it wrong.
I really don't see any cons with Postgres over SQLite for server applications.
> You don't need to maintain, secure and tweak postgres any more than you would with SQLite.
That's not true. Postgres is another standalone process, SQLite is a library. Even if you have your service and Postgres on the same box, you need to account for yet another process that can independently go down, that is competing for resources etc...
I've never had Postgres "go down". It might if you run out of disk space, but that is going to be a bad time with any database. It is not "competing for resources" when it is running the workload your app is sending it. You may as well say Sqlite is competing for resources in that case.
Major version upgrades are not automatic, you can't just install a newer binary/library version and start it as you can for SQLite. You need to shut down the DB and run `pg_upgrade`, or write manual full export-import scripts with `pg_dump`/`pg_dumpall`/`pg_restore`/`psql`.
And good luck deciding between the different format options, as some of them are unsupported across some of these tools, some cannot export and reimport the full database cluster, there's no idempotent "just import this snapshot" operation (point-in-time restore), lack of progress reporting, etc.
Here are some notes I on the topic:
# Note on Postgres backups
#
# Unfortunately, postgres backup+restore is not straightforward.
#
# * Backups created with `pg_dumpall`, which create an .sql file,
# cannot simply be used for point-in-time recovery.
# They need to be restored with `psql` (not `pg_restore`),
# which errors if the data already exists.
# * You could probably tell it to ignore errors, but naturally it'll just
# run through `INSERT ...`, so it's not a proper point-in-time recovery,
# because it doesn't remove data newer than the backup as expected.
# * To use `pg_restore` (which can ignore existing data, re-creating
# everything with the `--clean` flag), you need to use `pg_dump`
# (not `pg_dumpall`), which cannot backup *all* databases,
# only a single given one.
# * Further, `pg_restore` does not accept `--format=plain` SQL backups
# (the default created by `pg_dump`). Only the non-plain backups are
# accepted, which are less readable for a human to determine whether
# a given backup is the one desired to restore based on the data.
#
# As a result, we aim for restoration using `pg_restore --clean`,
# backing up only the `postgres` database using `pg_dump -d postgres`.
# This works for us because we currently store all our tables in the
# `postgres` database.
# We use `--format=tar` because it is a plain text format, which
# * deduplicates better than compressed formats, and
# * allows a human to `grep` in plain text for desired contents.
Why isn't there a mode with which I can just tell postgres to migrate my data automatically upon startup with a newer version?
And why can't I just have postgres-as-a-library to link into my binary, like I can do with SQLite?
You also can't just run postgres as root (e.g. in a container), and have to set up UNIX users to work around that, because postgres has it hardcoded to avoid running as root. This, too, you don't need to do with SQLite.
Also, postgres is harder to secure.
You need to either use TCP and ensure that other UNIX users on the same system can't just connect, or use UNIX Domain Sockets which have a 108 char path length restriction [1] (which is of course not documented in postgres's docs [2]), so it will suddenly break your CI when its path changes from
> if you use an ORM that supports both SQLite and Postgres you'll have the option to upgrade if your site brings in enough traffic
I'll never understand this idea that Postgres and SQLite are somehow interchangeable when the time is right.
My database and Postgres are _literally_ the core definition of everything that my application does. My app is written in Rust, but that doesn't matter because it's a _Postgres_ application. I use Postgres-specific features extensively. Converting the application to SQLite would be essentially a re-write, and it would be worse in every way.
Also, I generally just don't understand this fad of running production backends on SQLite. SQLite is great for what it is, a tiny little embeddable client-side database. But it is a _terrible_ database for non trivial business applications where ref integrity, real types instead of "everything is a string", and battle-tested scaling is essential.
I don't think people often switch from Postgres to SQLite, it's probably more common (and much easier) to prototype with SQLite lite first and then switch.
If by referential integrity you just mean FK constraints, you can turn that on in sqlite3.
I think SQLite is pretty good for a lot of use cases. An Axum/sqlite CRUD app should be able to handle at least few hundred requests per second on a medium powered box, which is good enough for a lot of things.
Postgres is really powerful but I don't think it's actually that common to structure your app around it's unique features.
Lately I've been using sqlite in this way on small projects I'm just hacking at, but after seeing pglite this week (https://news.ycombinator.com/item?id=41224689) I'll probably give that a try next time
The fact that you use Postgres-specific features extensively is a design decision that many people would never make, regardless of their trust in the engine.
In the enterprise space, it isn't. You often have to build software that will run on different database back-ends. Just because your worldview doesn't align with other people's doesn't mean you own the truth...
IMO a downside of SQLite that isn't discussed as often as it should be is the poor support for some table operations like ALTER COLUMN. Need to change a column to null / not null? Drop a foreign key constraint? Tough luck, in some cases the only way to implement a change is recreating the table.
Even without an ORM that supports both, as long as the DB layer is reasonably separated in your application it shouldn't be too much effort to switch. And if you've scaled to the point where it matters, you probably have the resources to do so.
>> as long as the DB layer is reasonably separated in your application
I find this is easy in retrospect but tricky when you’re building a system. It’s all shades of grey when you’re building:
Should I put my queue in my DB and just avoid the whole 2PC drama (saga is a more apt word but too much opportunity for confusion in this context).
I probably should implement that check constraint or that trigger but should I add a plugin to my DB to offer better performance and correctness of special type X or just use a trigger for that too?
Should I create my own db plugin so that triggers can publish messages themselves without going through an app layer?
In retrospect it’s easy to see when you went too far, or not far enough. At decision time the design document your team are refining starts to head past the ~10 page sweet spot limit.
That's true, even with "perfect" abstractions, switching gets more complicated as you use more complex database features.
It's only really easy if you push most of your constraints and triggers to the application. In practice, I've only ever switched databases with really simple CRUD stuff and have otherwise been able to predict that I'll eventually want Postgres/RabbitMQ/etc and build it in from the start.
Agreed. For some situations, it might well be easier to take advantage of the static nature of the site and use SQLite compared to setting up a Postgres server. For others, setting up a server could be easier than the “easy” options.
It's not worth pointing out the technical flaws in the post[1]. It is obvious the author does not have a strong grasp of the tools he is criticising. A better example of this style of post is Oxide's evaluation[2] for control plane storage that actually goes over their specific needs and context.
[1] Ok, just one, Rick Houlihan is currently at MongoDB.
> It's not worth pointing out the technical flaws in the post[1].
It might help your argument if you pointed out a real technical flaw in the content of the post, and not an example of the author being mistaken about a stranger's first name.
sure. 1. sqlite can have more than 1 file when using wal mode. 2. You don't need to know your exact dyanmodb access patterns upfront, you can evolve the schema. again, not worth effort to point out more.
DynamoDb is not a distributed hash map, it is a distributed forest of B-Trees. And B-Trees are what PG uses for indices and MySQL for both tables and indices. You don’t demoralize with it, but rather you build and maintain table and index like structures, they would contain submitted of ids or other data fields but it is the same with normal DB indexes. The difference if you have to maintain them manually. The upside is scale, serverlessnes and maybe less latency.
I believe what the author has in mind is the fact that you need to create your LSI at the same time as you create the table, you cannot add them later (until GSI). So there's some truth to what they are saying regarding access patterns.
On the MySQL vs Postgres topic: We migrated for two reasons.
The first is that I consider everything remotely owned by Oracle as a business risk. Personal opinion and maybe too harsh, but Oracle licenses are made to be violated accidentially so you can be sued and put on the license hook once you're audited, try as you might.
But besides that, Postgres gives you more tools to keep your data consistent and the extension world can save a lot of dev-time with very good solutions.
For example, we're often exporting tenants at an SQL level and import somewhere else. This can turn out very weird if those are 12 year old on-prem tenants. MySQL in such a case has you turn of all foreign key validations and whatever happens happens. A lot of fun with every future DB migration is what happens. With Postgres, you just turn on deferred foreign key validation. That way it imports the dump, eventually complains and throws it all away. No migration issues in the future.
Or the overall tooling ecosystem around PostgreSQL just feels more mature and complete to me at least. HA (Patroni and such), Backups (pgbackrest, ...), pg_crypto, pg_partman and so on just offer a lot of very mature solutions to common operational and dev-issues.
MySQL has a very mature open source HA story with the flavors of group replication, as well as being able to replicate DDL. Not to mention Orchestrator and friends.
As a matter of fact, EnterpriseDB (the largest contributor to Postgres) has a paid multi master offering, so there's anti incentives in place to improve its HA story...
Yeah there's absolutely no reason to use MySQL. You should use MariaDB if you are stuck on a MySQL project to avoid Oracle. And you should use PostgreSQL for everything new.
If Postgres already had decent temporal table support (per SQL:2011 system time + application time "bitemporal" versioning) we never would have gone down the road of building XTDB. From the perspective of anyone building applications on top of SQL with complex reporting requirements in heavily regulated sectors (FS, Insurance, Healthcare etc.), "just use temporal tables" would be the ideal default choice. To get an idea of why, see https://docs.xtdb.com/tutorials/financial-usecase/time-in-fi...
When people say "Just use SQLite. It's almost as good as Postgres and you won't need anything more" I'm trying to understand why I shouldn't just use Postgres. It's not like it's hard to install or has any significant overhead. Please enlighten me.
Go with Postgres. SQLite is somewhat barebones, you are getting like 5 datatypes grand total with it (counting null as separate type), 1 type of indexes, somewhat unexplored tooling around it. Also column type is not strict and you can write strings into integer column. With Postgres you are getting very established and rich ecosystem, with various and VERY optimized data types, indexes, extensions (like postgis), tooling around it, lots of tutorials how to fix it if something goes wrong.
P.S. but if your project don't need any of that, e.g. it's desktop audio player, just embed sqlite - remove another subsystem to care about
Postgres isn't hard to use, but it requires maintenance. You need more scripts, more tooling, more knowledge of DBA, and that may not be necessary. When you're using a database to store a few thousand rows of data, Postgres quickly becomes overkill. Postgres is a V8 engine when all you may need to power is a lawn mower.
I personally prefer to use abstractions like ORMs for most of my database interactions, and direct SQL when those abstractions get in the way (by generating expensive queries and not finding an easy fix without a large refactor).
This way, starting out with sqlite (good enough for most websites I reckon, easy to backup) doesn't interfere with any necessary migration to postgres (like when the need for scaling arises). This also makes setting up tests easier (except for the manually written SQL) because starting an application with a temporary in-memory database is a lot faster than starting a full container.
Unless I'm doing native apps, I'll probably always want to reserve the ability to use Postgres. Sometimes that means hooking up a Postgres account and such, but often that just means sticking with sqlite and leaving my options open for when sqlite doesn't work anymore.
> Postgres isn't hard to use, but it requires maintenance. You need more scripts, more tooling, more knowledge of DBA, and that may not be necessary.
I don't think Postgres needs to be maintained at all for small databases, which is usually the use case for SQLite. Their default configurations would take care of most things for trivial applications.
> Starting an application with a temporary in-memory database is a lot faster than starting a full container.
Starting a container might be way slower than SQLite, but I would still consider it fast for most, if not all use cases.
> hooking up a Postgres account
You can configure Postgres to start up in trust mode, which doesn't require a password for any user. This is basically the same as the unencrypted SQLite database file but with a fixed connection string: `postgresql://postgres@localhost`
> It's not like it's hard to install or has any significant overhead.
Depends on the environment or lack thereof, postgres is a pain in the ass on windows, and then you need support for software configuration so that it can talk to postgres, and then you have to take care of the features you're using.
If you're deploying a complex server-side system with lots of moving parts, then yes postgres is basically free. But if you're deploying client-side, or want to run it in a VPS, or whatever, postgres might go from not available to extra cost to a huge chore.
> Just use SQLite. It's almost as good as Postgres and you won't need anything more
Can't say I agree with that sentiment in any way though, every time I use it sqlite frustrates me in its limitations compared to postgres, and how weak the defaults are from a safety and consistency perspective.
> Depends on the environment or lack thereof, postgres is a pain in the ass on windows, and then you need support for software configuration so that it can talk to postgres, and then you have to take care of the features you're using.
This is why everyone uses docker and .env files. The problem has already been solved and you can copy/paste starter files from project to project to make it a non issue.
I would use SQLite if the embeddable part makes things considerably easier. For example, in a desktop or mobile application where a single application process is easier, because it fits the deployment model better.
Really just reads as an article reaffirming his own bias. For Mongo at least most of it is wrong.
- Secondaries are read replicas and you can specify if you want to read from them using the drivers selecting that you are ok with eventual consistency.
- You can shard to get a distributed system but for small apps you will probably never have to. Sharing can also be geo specific so you query for french data on the french shards etc lowering latency while keeping a global unified system.
- JSON schema can be used to enforce integrity on collections.
- You can join but this I definitely don’t recommend if possible.
- I personally like the pipeline concept for queries and wish there was something like this for relational databases to make writing queries easier.
- The AI query generator based on the data using Atlas has reduced the pain of writing good pipelines. Chat gpt helps a lot here too.
- The change streams are awesome and has let us create a unified trigger system that works outside of the database and it’s easy to use.
We run postgres as well for some parts of the system and it also is great. Just pick the tool that makes the most sense for your usecase.
Okay, I am very sorry that I got Rick Houlihan's name wrong.
In my defense, I hadn't watched his talks _recently_ and we've all been Berenstain Bear'ed a few times.
But also the comparison of DynamoDB/Cassandra to MongoDB comes directly from his talks. He currently works at MongoDB. I understand MongoDB has more of a flowery API with some more "powerful" operators. It is still a database where you store denormalized information and therefore is inflexible to changes in access patterns.
You can store normalized information. What you’re saying is still wrong. With respect to schema you can use Atlas Schema. If you’re not really familiar you shouldn’t make these comparisons IMO.
> It is still a database where you store denormalized information and therefore is inflexible to changes in access patterns.
It is flexible and you don't need to know your exact access patterns upfront. It may not be as flexible as your chosen technology, but that doesn't make your statement true.
Depends on your workload, immutable tuples and vacuum can hurt a lot. (Although that got much better recently) Also the mad decisions about the query planner which you can't control. Often this doesn't matter, but it's worth being aware.
Oh the contortions I've gone through with the query planner to get it not to do things.
Also, if you're using JSONB, long strings or other toast entries, your query plan and your performance will be wildly divorced since the planner doesn't factor in a lot of the toast IO and associated memory management. The lesson for others here is if you have JSONB/long text fields, store them in their own table.
It’s because PHP’s popularity is old and Postgres used to be mid. MySQL was faster and better than Postgres.
But Postgres slowly improved and then got better than MySQL while MySQL stagnated. The most basic bugs persisted, basic features never got added and consistency never seemed to be a point of improvement.
Not sure about Apache, but Linux, MySQL and PHP is still the most common combo in terms of number of sites running it. Wordpress alone is enough to establish that.
Their reasoning is that some platforms like Heroku do not support SQLite.
Why use those then and not a platform that supports it, like Glitch?
I have used Postgres, MySql etc, but having the project storage in a single file is making things so much easier, I would never ever want to lose that again.
For MySQL, for smaller deployments, I've found Galera to really be a handy HA system to get going:
> Galera Cluster is a synchronous multi-master database cluster, based on synchronous replication and MySQL and InnoDB. When Galera Cluster is in use, database reads and writes can be directed to any node. Any individual node can be lost without interruption in operations and without using complex failover procedures.
> For MySQL, for smaller deployments, I've found Galera to really be a handy HA system to get going:
Well, at least if you don't value your data.
(https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster; Galera failed Jepsen testing in 2015 and the bug is still open with a 2022 mention of basically “we have experimental support [for actually providing the data consistency we promise], but it's not clear if it's worth it because it will be very slow”)
SunOS is/was an BSD based operating system, and Linux replaced sun servers that were running SunOS with Intel servers in the same way as Linux based servers replaced most server operating systems in data centers (and today cloud providers), like Ultrix, IRIX, HP-UX etc. I've meant "do" in this way.
I'd like to mention that CouchDB is really useful for one reason - a very robust sync story with clients, and a javascript version called PouchDB that can run on the browser and do bidirectional sync with remote Couch instances.
This can be done with sqlite by jumping through a few extra hoops, and now with in-browser WASM postgres, there as well with a few more hoops, but the Couch -> Pouch story is easy and robust.
> It's annoying because, especially with MongoDB, people come into it having been sold on it being a more "flexible" database. Yes, you don't need to give it a schema. Yes, you can just dump untyped JSON into collections. No, this is not a flexible kind of database. It is an efficient one.
I really like this sentence because it perfectly encapsulates a mistake that, I think, people do when considering using MongoDB.
They believe that the schemaless nature of NoSQL database is an advantage because you don't need to do migrations when adding features (adding columns, splitting them, ...). But that's not why NoSQL database should be used. They are used when you are at a scale when the constraints of a schema become too costly and you want your database to be more efficient.
Absolutely this. The author seems blind to Postgres shortcomings.
For example, the author notes MySQL has "features locked behind their enterprise editions." That is true for some features in MySQL, yes. But the same thing is true in Postgres for DDL logical replication and other HA-related features, which are only in EDB Postgres -- and yet those are features MySQL has had in open source for over two decades.
Every database has issues and quirks whether they be about how you design your application, how you need to scale, or how you need to maintain your database. You can play this game “just use XYZ and have no problems”, but it isn’t realistic. Production databases at scale require heavy dedicated infra to stay highly available and performant, and even out of the box solutions require you to understand what is going on and tune them else you run into “surprises” which are almost always that no one RTFM. Pretty much every mainstream database is capable of both highly available and highly consistent workloads at scale. The storage engine largely shouldn’t matter as much as the application tuning.
SQlite is easy to backup, especially if you are OK with write locking for long enough to copy a file. It now has a backup API too of you are not OK with that.
Lots of things do not scale enough to need more than one application server. A lot of the time, even though I mostly use Postgres, the DB and the application are on the same server, which gets rid of the difficulties of working over a network (more configuration, more security issues, more maintenance).
The main reasons I do not use SQLite are its far more limited data types and and its lack of support for things like ALTER COLUMN (other comments have covered these individually).
It has never made sense to me why someone uses no, and then proceed to little by little to make their implementation into having relations.
It’s way less work just to learn sql or an orm.
Nosql is great at being a document store.
I’ve used MySQL longer, it’s been a good default option, the jump to how Postgres works and what it offers is too much to ignore.
Postgres can act as a queue, many of the functions that a nosql has, handle being ann embedding db, and do so until a decent volume. It can be the backbone of many low code tools like supabase, hasura, etc. the only thing that’s different is there seems to be nice currents for MySQL but you get the hang of it pretty quick.
I can’t speak to the official decisions made by these camps/courses, but from my own experience as an undergrad, I was first introduce to MySQL, and the professors at my university did not teach using migration management tools for bringing a schema in a database up. You were either using a GUI to set up the tables, or running your own cobbled together sql files. For class assignments this was fine. Then I had a professor introduce mongo to me. I was floored by the idea of having my schema live along-side the application code! No more messing around in SQL GUIs! Then of course over time I realized you still need to maintain a schema over time and provide someway to “upgrade” data when your schema evolves, and keep your data consistent. Then I discovered the tools around migrating mongo data are not nearly as mature as the ones you’ll find for SQL databases.
I find mongo alright at producing a short-lived prototype of an application (e.g. school assignments), but the risk of it shipping to production for a long period is too risky for the “benefit”.
I've found out key/value databases pushes for better architectural designs in enterprise environments. Especially in companies where different teams are responsible for a given business capability and it needs to scale above 1+ million users.
Postgres flexibility enables for design that is hard to scale. Both in terms of maintainability and performance. Enforcing K/V as a default database in one of my previous companies worked wonders.
I don't think there's anything unscalable about Postgres, or RDBMS's in general. I've seen even poorly tuned Postgres with unnormalised table designs work fine at a decent scale, to the point where I'm convinced that Postgres with a decent table design gets you very far.
As in: far enough that if you outscaled it, you'd be able to afford a team of excellent engineers to write an appropriate database system.
Almost all companies don't need the hyper scaling NoSQL databases supposedly promise. What they do often eventually realise is that they want the querying power and additional ACID guarantees of a typical relational database, so they end up developing a shitty relational database on top of a NoSQL database.
My current company uses Postgres by default and we have a lot of different usecases. Again, another million+ users company with 10+ countries. It does indeed scale.
The problem is how people think about SQL vs K/V. They fall into the normalization trap a lot and create complex procedures and read operations. This usage causes once a month DB CPU spikes and some inident.
We are currently advocating for; de-normalized tables with K/V usage of Postgres and pushing the complexity to the application layer. Essentially, use Postgres at its bare minimums.
In short, to make Postgres scale; you essentially need to forget your "expert SQL knowledge" and use it as a K/V.
I think a balance is needed. Completely denormalised isn’t a good idea because if you have a single table with a large row, but a single column receives most the updates and it updates a lot, you’re going to have tons of dead tuples and write churn/overhead whenever updating a row in that table.
But I agree that some people go too far with normalisation. When done reasonably, with awareness of access patterns and application behaviour, I think it’s important though.
It mostly revolves around; understanding your primary business concept that write operations revolve around (aggregate root) and duplicating data for different read scenarios (view models).
For example imagine you have an "E-commerce" product which you can change details about. The "Product" would be a write-model that you store as K/V. It would accept operations such as; "change price", "change category" etc. Your key would be "product id" and the value would be the whole object represented as json etc.
For every write operation you would read the write-model from the database, deserialize, modify it, put it back. Changes to the write-model would trigger events and you could build different read-models to access the data.
* You know exactly what your app needs to do, up-front
But isn't this true of any database? Generally, adding a new index to a 50 million row table is a pain in most RDBs. As is adding a column, or in some cases, even deleting an index. These operations usually incur downtime, or some tricky table duplication with migration process that is rather compute + I/O intensive... and risky.
50M rows is really not that much, I’d guesstimate an index creation to take single-digit minutes.
None of these operations I’d expect to cause downtime, or require table duplication or to be risky
Edit: to be fair, you’re right there’s footguns. Make sure index creation is concurrently, and be careful with column default that might take a lock. It’s easy to do the right thing and have no problem, but also to do the wrong thing and have downtime
Newer versions of Postgres also support dropping indexes concurrently. I recommend using the concurrently option when dropping unused or unneeded indexes on any table with active writes and reads.
https://www.postgresql.org/docs/current/sql-dropindex.html
Totally agree - I have tried many databases of all flavors, but I always come back to Postgres.
HOWEVER - this blog post is missing a critical point.... the quote should be:
---> Just use Postgres
AND
---> Just use SQL
"Program the machine" stop using abstractions, ORMs, libraries and layers.
Learn how to write SQL - or at least learn how to debug the very good SQL that ChatGPT writes.
Please, use all the very powerful features of Postgres - Full-Text Search, Hstore, Common Table Expressions (CTEs) with Recursive Queries, Window Functions, Foreign Data Wrappers (FDW), put JSON in, get JSON out, Array Data Type, Exclusion Constraints, Range Types, Partial Indexes, Materialized Views, Unlogged Tables, Generated Columns, Event Triggers, Parallel Queries, Query Rewriting with RULES, Logical Replication, PartialIndexes, Policy-Based Row-Level Security (RLS), Publication/Subscription for Logical Replication.
Push all your business logic into big long stored procedures/functions - don't be pulling the data back and munging it in some other language - make the database do the work!
All this stuff you get from programming the machine. Stop using that ORM/lib and write SQL.
EDIT:
People replying saying "only use generic SQL so you cans switch databases!" - to that I say - rubbish!
I nearly wrote a final sentence in the above saying "forget that old wives tale about the dangers of using a databases functionality because you'll need to switch databases in the future and then you'll be stuck!"
Because the reason people switch databases is when they switch to Postgres after finding some other thing didn't get the job done.
The old "tut tut, don't use the true power of a database because you'll need to switch to Oracle/MySQL/SQL server/MongoDB" - that just doesn't hold.
your answer was fine and good until you classified ChatGPT's SQL generation as "very good" -- which it is not. I've had _all_ GPT models spit out monstrosities and slow queries of all kinds.
ORMs are not all bad. In fact, some ORMs generate better code for really complex joins (think hundreds of tables, each with hundreds of columns) than humans, and often ensure that trivial best practices (like indexes and consistent foreign keys) are followed.
Writing SQL is a great skill, but if you tie yourself to a single database engine's idioms then you're in for a shock when you switch platforms/jobs/environments.
"Push all your business logic into big long stored procedures/functions - don't be pulling the data back and munging it in some other language - make the database do the work!"
From my courses I had at university, I've been led to believe that the current trend is doing hexagonal architecture, as that allows for better modularisation of the project and helps keep code clean over many years with many software engineers coming in and out.
As a part of that I've been taught that the only part you could trust then is your internal modules - and even database has to treated as an external source, whose only job is to pull data in and out.
How does that work in what you're suggesting? Is it just a different way of approaching things that will work depending on what's your goal is?
I'm just curious about this as I'm trying to get myself to learn a bit more, just to clarify
The model described in the parent comment is essentially using the database (in this case PostgreSQL, but any RDMS would do) as the hexagonal "core" in which adapters plug in to. This is a powerful pattern that works very well when you use the full features of the RDMS like constraints, triggers, views, etc. This does require "coupling" to the RDMS-specific features, which makes migrating to an alternative system difficult, but in practice this rarely happens if you choose a strong RDMS from the beginning.
You can certainly use the database as a "dumb storage" tool in the hexagonal architecture, that is, as just another adapter. But most of the time you'll end up re-creating RDMS features in poorly written/documented application code that has to interact with the database anyways. Why not just do it all in the database? With a RDMS core, hexagonal adapters can be pure functional components, making them much easier to reason about and maintain.
For more on this idea, and how to avoid pitfalls with the hexagonal pattern, I recommend reading Out of the Tar Pit [1]. It's a short but highly influential paper on "functional relational programming".
I see it, that makes sense! Right, I don't think that you'd have a reason to swap RDMS unless licencing issues come up, like with Oracle. Thank you for helping out.
> "Push all your business logic into big long stored procedures/functions - don't be pulling the data back and munging it in some other language - make the database do the work!"
This is one of the categories of opinions that I’ve heard, the proponents of which suggest that databases will typically be more efficient at querying and transforming data, since you’ll only need to transfer the end result over a network and will often avoid the N+1 problem altogether.
You probably don’t want some reporting or dashboard functionality in your app to have to pull tens or hundreds of thousands of rows to the back end, just because you have decided to iterate over the dataset and do some transformations there.
That said, I’ve worked in an app where the Java back end only called various stored procedures and displayed their results in tables and while it was blazingly fast, the developer experience was miserable compared to most other projects I’ve worked with - lots of tables with bad naming (symbol limits in that RDBMS to thank), badly commented (not) procedures with obscure flags, no way to step through anything with a debugger, no proper logging, no versioning or good CI tooling, no good tools for code navigation, no refactoring suggestions, no good tracing or metrics, nothing.
Sure, it might have just been a bad codebase, but it was worse than most of the ones where too much logic is in the back end, those just run badly, so I get the other category of opinions, which suggests that trying to use the DB for everything isn’t a walk in the park either.
There’s probably a good balance to be found and using tools in ways that both perform okay and don’t make the developer experience all that bad.
Ofc I reserve the right to be wrong, just wanted to share my subjective experience, that there can be tradeoffs and there probably aren't any silver bullets.
For the most part, I think that you should put any mass/batch processing in the DB (just comment/version/test/deploy your code like you would on the back end, as best as you can with the tools available to you) and don't sweat too much about handling the CRUD operations in your back end, through whatever ORM you use or don't use (regular queries are also fine, as long as parametrized to prevent injection).
For complex schemas, a nice approach I've found is making one DB view per table/list/section of your front end, so you only need 1 DB call to load a particular component, otherwise the N+1 risk gets far greater ("Oh hey, I got this list of orders, but each other needs a delivery status, so I'll just iterate over those and fetch them for each item, whoops, the DB is spammed with requests.").
You gain: Model consistency guaranteed by the database, your backend basically only acts as an external API for the database.
You lose: Modularity, makes it harder to swap out databases. Also, you have to write SQL for business logic which many developers are bad at or dislike or both.
I've seen a system running on this approach for ten years and it survived three generations of developers programming against this API. There's Python wx frontends, web frontends, Rust software, Java software, C software, etc. They all use the same database procedures for manipulating the model so it stays consistent. Postgres is (kinda, not very) heavy for small projects but it scales for medium up to large-ish projects (where it still scales but not as trivially). One downside I've seen in this project is that some developers were afraid to change the SQL procedures so they started to work around them instead of adding new ones or changing the existing ones. So in addition to your regular work horse programming language you also have to be pretty good at SQL.
With Pglite in a few years you'll be able to treat Postgres as a library, and merely allow selecting between in-process and remote via configuration.
The first job of a database is to be a data structure for persisting data, but you're allowed to extend said data structure in your own code. As long as you can come up with a way to keep all the code in version control, test it, etc., it's fine.
Great list of Postgres features called out that highlight the extensive feature set.
Most of these are covered in my book, for anyone that’s interested in learning them. The book uses a Ruby on Rails app with Postgres instances for examples and exercises. Hope the plug is ok here as some folks may be looking for learning resources for Postgres.
https://andyatkinson.com/pgrailsbook
You should use the most powerful and convenient language you have at your disposal, which is most likely the host language.
ORMs are funny things, it's like we got stuck in the idea of making the database object oriented. MongoDB just means we don't have to pretend anymore, not that it was a good idea.
It is perfectly possible to use relational concepts in a general purpose language. Tables, Columns, Foreign Keys, Records, Indexes, Queries etc. And you can build whatever Model abstractions you need on top of that; or not, for simple CRUD you don't really need a type system.
I usually build that layer along with the foundation of the application, it still evolves slightly every time around but the basics are very tried and proven by now.
Try to avoid the bespoke features of psql in favor of generic SQL unless cornered by circumstances into doing so, methinks.
If there's one complaint I have about pg, it's that it has too many features that encourage finding cute, non standard, non obvious ways of going about things.
> Try to avoid the bespoke features of psql in favor of generic SQL unless cornered by circumstances into doing so, methinks.
Why? To make migration to another database easier? I've never had the need to migrate any application away from postgres. I usually take full advantage of what the database can do.
I’m a proponent of vendor lock in is not a big deal - you’re not going to switch from AWS to Azure on a whim and if you do, the fact that you’re using ecs instead of k8s isn’t going to slow you down.
But data ownership is the one place I get iffy. What if your db does a rug pull and changes licenses? There’s certainly precedent in this space for that.
Stored procedures suck. First of all SQL is a dubious language to write business logic in because it doesn’t have static typing and other goodies we expect nowadays. But more importantly stored procedures tend to drift out of version control. So please don’t.
Still a lot more complicated to deal with in every aspect, obviously so to anyone who has long term experience with them, what are you gaining from pretending otherwise?
For everyone saying, “Just use SQLite”, how do you deal with pathological queries causing a denial of service? SQLite is synchronous, so you end up blocking your entire application when a query takes a long time. It’s a problem in Postgres, too, especially if the query involves table locks, but your app can Postgres can generally hobble along.
why does it even matter? I know that I need multimodal search in my product, and that is why I need vector DB. You're not saying anything interesting by saying "AI is a bubble". If you say something like I may not actually need RAG/mutimodal/semantic search/dedicated vector db then you may have my attention.
so like if the funding for AI disappears then somehow my requirement of multimodal search also disappears, and with it all the existing solutions, some NOT VC funded, like pgvector?
I wish postgres had a library only mode that directly stored to a file like sqlite. That'd make starting development a lot easier since you don't have to jump through the hoops of setting up a postgres server. You could then switch to a "proper" DB when your application grows.
It's literally 4 lines of Python code calling subprocess.Popen to start a PostgreSQL server for a given database directory and connecting to it via a pipe on the filesystem. However, you can't launch multiple concurrent instances like this.
eg on RDS, they'll give you instances with 1TB of RAM, eg a `db.r6idn.32xlarge`, at the nice price of $75/hr ($54k/mo). Not to mention that, in a microservices architecture, assuming you're not sharing a database, you might be multiplying that figure out a few times.
So just because it's possible for it to fit in RAM doesn't mean it's economical. RAM isn't exactly getting exponentially cheaper or more spacious anymore. The hope was flash memory would be the solution, but not sure how far that's getting these days.
Personally, most of the projects i do are in self-hosted servers. The traffic isnt big. In such cases sqlite has been way better than postgres. Many times i see postgres not well used. Its meant for big project, not small ones.
I don't hate SQL and I agree for many applications it makes sense, but I disagree 100% with "default to a SQL database" (like Postgres). Instead, figure out what you need based on your app.
Recently I had the opportunity to rewrite an application from scratch in a new language. This was a career first for me and I won't go into the why aspect. Anyway, the v1 of the app used SQL and v2 was written against MongoDb. I planned the data access patterns based on knowledge that my DB was effectively document/key/value. The end result: it is much simpler. The v1 DB had like 100+ tables with lots of relations and needs lots of documentation. The v2 DB has like 10 "tables" (or whatever mongo calls them) yet does the same thing. Granted, I could have made 10 equivalent SQL tables as well but this would have defeated the purpose of using SQL in the first place. This isn't to say MongoDB is "better". If I had tons of fancy queries and relations I needed it would be easier with SQL, but for this particular app, it is a MUCH better choice.
TL;DR Don't default to anything, look at your requirements and make an intelligent choice.
I'm always cautious with a one-size-fits-all approach. If a team is working on a small project and SQLite works then great. You can use a SQLite database on something like a $4/month DigitalOcean droplet. Can't say the same for Postgres.
> AI is a bubble
Many say this but Generative AI and LLMs have gotten bunched up with everything else. There is a clear need for vectors and multimodal search. There is no core SQL statement to find concepts within an image for example. Machine learning models support that with arrays of numbers (i.e. vectors). pgvector adds vector storage and similarity search for Postgres. There was a recent post about storing vectors in SQLite (https://github.com/asg017/sqlite-vec).
> Even if your business is another AI grift, you probably only need to import openai.
There's much more than this. There are frameworks such as LangChain, LlamaIndex and txtai (disclaimer I'm the primary author of https://github.com/neuml/txtai) that handle generating embeddings locally or with APIs and storing them in databases such as Postgres.
Why can't you run Postgres on a 4 USD droplet? They seem to have 512 MB RAM, that is enough for a basic Postgres instance and a HTTP application server.
Yes, one very common use case where Postgres does not scale well is analytics. Snowflake, Vertica, and ClickHouse work much better. I say this after working on several projects where development teams hit a wall with row storage databases. Still, PostgreSQL is great, it's my default DB as well.
Here we go again...
Just use X, forever, in all cases, is misguided whatever X is - a database, programming language, ... a vehicle.
PostgreSQL is good for many things and default to PostgreSQL and use something else if clearly justified is a sound advice, but assuming there is no room for anything else but PostgreSQL is not.
> Missing sqlite comparison point: data types. SQLite is like JS with column datatypes, except even looser.
Also defaults:
- sqlite has STRICT tables, you have to opt in, per table.
- sqlite does not check foreign keys by default, you have to opt in, per connection.
- sqlite has WAL mode, you have to opt in, per database. And even with that you may want / need to add a fair amount of work to ensure you're not upgrading connections lazily (fecking SQLITE_BUSY).
There is absolutely no reason you can't make SQLite go all the way. Starting with it is the only thing that makes sense to me.
It is certainly a higher performance solution in the fair comparison of a hermetically sealed VM using SQLite vs application server + Postgres instance + Ethernet cable. We're talking 3-4 orders of magnitude difference in latency. It's not even a contest.
There are also a lot of resilience strategies for SQLite that work so much better. For instance, you can just snapshot your VM in AWS every x minutes. This doesn't work for some businesses, but you can also use one of the log replication libraries (perhaps in combination with snapshots). If snapshots work for your business, it's the most trivial thing imaginable to configure and use. Hosted SQL solutions will never come close to this level of simplicity.
I personally got 4 banks to agree to the snapshot model with SQLite for a frontline application. Losing 15 minutes of state was not a big deal given that we've still not had any outages related to SQLite in the 8+ years we've been using it in prod.
Surprised to hear that losing 15 min of state is not a big deal in a banking context.
I haven't worked with banks before, genuinely curious, how do they recover from something like this? Wouldn't this potentially destroy all transactions made in that time period?
To be fair they said "frontline application", not a transaction processor specifically. Also in cases where you can't lose stuff it's common to use reliable steam processing, so maybe they can reprocess old events into the system running sqlite. (This is extra general info about "important systems"; I don't know what OP is running)
At best this will increase the load on customer support with an associated reputational hit. At worst you will need to deal with questions from financial regulators.
> There is absolutely no reason you can't make SQLite go all the way.
No reason you can't, but one you to consider if you should. If you're using libraries which make assumptions about your database layer, they may not like the sqlite model. Holding the writer mode for too long is something I experienced in write a few projects. For example paperless-ngx will block the web interface while batch importing documents, even though there's really no reason to do that.
It's less of an issue if you write all your own code and you explicitly target sqlite. But worth keeping in mind.
Why would you run Postgres on a different box in such a scenario? A docker run command to get a Postgres instance up and running isn’t any more complicated than linking in Sqlite, maybe even simpler. And you get proper ACID transactions for free.
Sure. I think the caveats I mentioned are real though.
Using SQLite locks you out of some platforms as a service and out of some application architectures. It also means you need to be doing stuff like snapshotting your VM every x minutes.
Given minimal certainty about the scale, future, and properties of your application and organization: Postgres is the better default.
I like this sentence way more than I should.