Hacker News new | past | comments | ask | show | jobs | submit login

I run a 100 billion+ rows Postgres database [0], that is around 16TB, it's pretty painless!

There are a few tricks that make it run well (PostgreSQL compiled with a non-standard block size, ZFS, careful VACUUM planning). But nothing too out of the ordinary.

ATM, I insert about 150,000 rows a second, run 40,000 transactions a second, and read 4 million rows a second.

Isn't "Postgres does not scale" a strawman?

[0] https://www.merklemap.com/




People run postgres at scale which is much "larger" than what you are running, which isn't to say that your workload isn't substantial, it is. But there are folks who push it harder, both in terms of writes/rows and workload complexity. It's one thing to write a ton of rows into a single table, it's another thing entirely to write into many tables, index all of them and then query them at scale; you didn't mention much about your workload complexity so I'm not trying to suggest it isn't complex, but there are certainly plenty of folks on this forum who can attest to it being tricky to scale high write workloads, in large organizations with many stakeholders and complex query loads.


> it's another thing entirely to write into many tables, index all of them and then query them at scale

Well, that’s pretty much what I am doing.


> PostgreSQL compiled with a non-standard block size

Do you think this could become less important for your use case the new PG17 "I/O combining" stuff?

https://medium.com/@hnasr/combining-i-os-in-postgresql-17-39...


No, because the custom block size is about reducing zfs r/w amplification.


One of things I find challenging is understand the meaning of the word "scales". It is sometimes used differently in different contexts.

Can it be performant in high load situations? Certainly. Can is elastically scale up and down based on demand? As far as I'm aware it cannot.

What I'm most interested in is how operations are handled. For example, if it's deployed in a cloud environment and you need more CPU and/or memory, you have to eat the downtime to scale it up. What if it's deployed to bare metal and it cannot handle the increasing load anymore? How costly (in terms of both time and money) is it to migrate it to bigger hardware?


When it "scales", it usually means "scales up". A scalable solution is such that can withstand a large and increasing load, past the usual limitations of obvious solutions.

Being elastic is nice, but not always needed. In most cases of database usage, downsizing never happens, or expected to happen: logically, data are only added, and any packaging and archiving only exists to keep the size manageable.


You’re conflating things. The question was about scaling compute and memory up and down based on load and you’re commenting about never needing to downsize on storage.


a database scaling dramatically up and down /under load/ and expecting it to perform the same as steady state seems a bit weird, vs a single, beefy database with a beefy ingest job and a bunch of read only clients searching it?

like you're more likely to encounter two phases (building the DB in heavy growth mode, and using the DB in light growth heavy read mode).

A business that doesn't quite yet know what size the DB needs to be has a frightening RDS bill incoming.


Damn, that’s a chonky database. Have you written anything about the setup? I’d love to know more— is it running on a single machine? How many reader and writer DBs? What does the replication look like? What are the machine specs? Is it self-hosted or on AWS?

By the way, really cool website.


I'll try to get a blog post out soon!

> Damn, that’s a chonky database. Have you written anything about the setup? I’d love to know more— is it running on a single machine? How many reader and writer DBs? What does the replication look like? What are the machine specs? Is it self-hosted or on AWS?

It's self-hosted on bare metal, with standby replication, normal settings, nothing "weird" there.

6 NVMe drives in raidz-1, 1024GB of memory, a 96 core AMD EPYC cpu.

A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IHMO is not providing much benefits outside of niche use-cases).

> By the way, really cool website.

Thank you!


> A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IHMO is not providing much benefits outside of niche use-cases).

That's kind of where I'm at now... you can vertically scale a server so much now (compared to even a decade ago) that there's really no need to bring a lot of complexity in IMO for Databases. Simple read replicas or hot spare should be sufficient for the vast majority of use cases and the hardware is way cheaper than a few years ago, relatively speaking.

I spent a large part of the past decade and a half using and understanding all the no-sql options (including sharding with pg) and where they're better or not. At this point my advice is start with PG, grow that DB as far as real hardware will let you... if you grow to the point you need more, then you have the money to deal with your use case properly.

So few applications have the need for beyond a few million simultaneous users, and avoiding certain pitfalls, it's not that hard. Especially if you're flexible enough to leverage JSONB and a bit of denormalization for fewer joins, you'll go a very, very long way.


> you can vertically scale a server so much now

And you often don't really need to.

Just last week for some small application and checking the performance of some queries I add to get random data on a dev setup. Which is a dockerized postgres (with no tuning at all) in a VM on a basic windows laptop. I inserted enough data to represent what could maybe be there in 20 years (like some tables got half a billion rows, small internal app). Still no problem chugging along.

It is crazy when you compare what you can do with databases now on modern hardware with how other software do not feel as having benefited as much. Especially on the frontend side.


Considering the front end today is an amazingly flexible client app platform with flexible rendering styling and accessibility compared to a VB app a few decades ago... It's kind of amazing.

Only if my favorite websites in the late 90s was 15 seconds make because that's how long people would wait for a webpage to load at the time. Things have improved dramatically.


> accessibility

I'd like to see those accessible frontends. The majority is not usable keyboard-only.


I have done a lot of work for eLearning, govt and banking which required good usability. Also, mui is very good out of the box.

https://mui.com/material-ui/all-components/


I'm gonna doubt you when on they own website navigating is really bad.

Only tab and shift-tab. Arrow keys are a bust. And the only visible shortcut is ctrl-K for the search input and I think it's because it comes as an algolia default.

For something better I only have to watch around the page at the browser itself: underlined letters in the menu tells me what alt+letter will open said menu. Then I can navigate using arrow keys and most menu items are shown with a key combination shortcut.


If I could show you some of the apps I've built with it would probably change your mind. A few had to go through testing and validation for accessibility. That and I'm pretty firm on keyboard navigation for all things. Had to tweak the corporate colors a little bit to fit WCAG compliance on the contrasts.

One thing that was crazy was having to go through verification for blind usability, when the core function (validating scanned documents) requires a well sighted user.

I won't say MUI is perfect... it isn't... but you can definitely go a lot farther in a browser than you can with what's in the box with most ui component libraries is the only real point.


> I'll try to get a blog post out soon!

Please do.

> It’s self-hosted on bare metal, with standby replication, normal settings, nothing “weird” there.

16TB without nothing weird is pretty impressive. Our devops team reached for Aurora way before that.

> 6 NVMe drives in raidz-1, 1024GB of memory, a 96-core AMD EPYC CPU.

Since you’re self hosted, I’m you aren’t on AWS. How much is this setup costing you now if you don’t mind sharing.

> A single database with no partitioning (I avoid PostgreSQL partitioning as it complicates queries and weakens constraint enforcement, and IMHO does not provide many benefits outside of niche use cases).

Beautiful!


> Since you’re self hosted, I’m you aren’t on AWS. How much is this setup costing you now if you don’t mind sharing.

About 28K euros of hardware per replica IIRC + colo costs.


Yearly, 28k Euros I presume.

Damn. I hope you make enough revenue to continue. This is pretty impressive.


No, one time + ongoing colocation costs.


So that's about 467 eur per month per server assuming a 5 year term. Anyone know what it would be on AWS with Aurora? I had a quick go with https://calculator.aws/ and ended up with a 5-figure sum per month.


I tried for fun:

https://calculator.aws/#/estimate?id=cfc9b9e8207961f777766e1...

Seems like it would be 160k USD a month.

I could not input my actual IO stats there, I was getting:

Baseline IO rate can't be more than 1000000000 per hour.


The CPU itself is around $8-10k for a top-end AMD Epyc, $15-20k for the rest of the server, including memory and storage is probably about right. There are still $100k+ servers, but they tend to be AI equipment at this point, not the general purpose stuff, which is sub $30-50k now.


I mean no disrespect, but it is stunning how hard the idea of owning your own hardware is to a large percentage of the tech population.

You can just… own servers. I have five in a rack in my house. I could pay a colo a relatively small fee per month for a higher guarantee of power and connectivity. This idea also scales.


> 16TB without nothing weird is pretty impressive. Our devops team reached for Aurora way before that.

Probably depends on the usage patterns too. Our developers commit atrocities in their 'microservices' (which are not micro, or services, but that's another discussion).


I continue to find horror shows during incidents; sometimes not even the cause, merely a tangential rabbit hole I wandered down.

“Are you… are you storing images in BLOBS?”

“Yes. Is that bad?”


Your replies are really valuable and informative. Thank you so much.

Question - what is your peak utilization % like? How close are you to saturating these boxes in terms of CPU etc?


I’d say 60-70% overall cpu usage, including database, ingest workers, web app and search.

> Your replies are really valuable and informative. Thank you so much.

Thank you!


I'm also self-hosting Postgres, and the project is getting to the point where a standby would be a good idea to ensure higher availability.

Did you use any particular guide for setting up replication? Also, how do you handle failover/fallback to/from standby please?


Not OP, but managed a ~1tb Postgres install for years. You should use something like pgbackrest or barman to help with both replication (replicas can pull WAL from your backups when catching up), backups, and failovers.

At least for pgbackrest, set up a spool directory which allows async wal push / fetch.


> 6 NVMe drives in raidz-1

Did you benchmark io rate with different ZFS layouts?

6 NVMe drives in mirrored pairs would probably be substantially higher latency and throughput

Though you'd probably need more pairs of drives to match your current storage size. Or get higher capacity NVMe drives. :)


> It's self-hosted on bare metal, with standby replication, normal settings, nothing "weird" there.

I can build scalable data storage without a flexible scalable redundant resilient fault-tolerant available distributed containerized serverless microservice cloud-native managed k8-orchestrated virtualized load balanced auto-scaled multi-region pubsub event-based stateless quantum-ready vectorized private cloud center? I won't believe it.


+1 as I'm hoping this is sarcastic humor.


from riches to RAG.


And here I am, having an ERP software at hand whose database I "should not query" with some (3-6) joins and likely in the 100'000s of records in total in the larger tables because "the DB server will go down".


PeopleSoft?


Progress


"Postgres does not scale" means that you can't just give it more machines, which is true. At some point you've got the fastest available machine, maybe you've played with the settings, and that's it. Then you can embark on the kind of manual sharding journey the article describes.

But most of the time, an RDBMS is the right tool for the job anyway, you just have to deal with it.


> "Postgres does not scale" means that you can't just give it more machines, which is true.

Well, it’s only true for writes.


It's still true for reads if you need them to be fully consistent, but yeah, read replicas are the answer sometimes.


Having some issues with your numbers, but I’m probably just missing something…

If you insert 150K rows per second, that’s roughly 13 Billion rows per day.

So you’re inserting 10%+ of your database size every day?

That seems weird to me. Are you pruning somewhere? If not, is your database less than a month old? I’m confused.


Well, that's why I said "ATM", it's not a sustained rate, all the time. And yes, there's a bunch of DELETEs too.


Regarding MerkleMap: Ouch! I never realized letsencrypt is leaking all of the sub domains i'm creating. :-/ I thought the names were giving me a bit of extra security through obscurity.

Why does it do that? I thought only revocations need to be published?


It does that because it's required to. The system is called Certificate Transparency[1] and browsers require certificates to be in CT logs to be accepted.

If you want to hide what subdomains you have you can use a wildcard certificate, though it can be a bit harder to set up.

[1]: https://developer.mozilla.org/en-US/docs/Web/Security/Certif...


Thanks for explaining!


> ATM, I insert about 150,000 rows a second, run 40,000 transactions a second, and read 4 million rows a second.

Silly question but is this at the same time, regular daily numbers or is that what you've benchmarked?


I used to run a bunch of Postgres nodes at a similar scale. The most painful parts (by far) were restoring to a new node and major version upgrades.

Any tricks you used for those parts?


> were restoring to a new node

Zfs send / recv or replication.

> I used to run a bunch of Postgres nodes at a similar scale. The most painful parts (by far) were restoring to a new node and major version upgrades. Any tricks you used for those parts?

Replication makes this pretty painless :)


So run a replica? Is there more literature on that?



You have decent documentation portal https://www.merklemap.com/documentation with related articles and API examples, great work! By the way, did you use any markdown2html converter here to built developer portal or this is just a simplified Redocly version of it?


It’s mdx, with nextjs.

Thanks for the kind words!


Super interesting compiling pg, I assume, with same as the zfs block size! It was always on our todo to try, but never got around to it. If possible, what block size did you end up with? Have you tried zfs direct io in 2.3.x, if so, could you share any findings? Thanks for sharing - and cool website!


I don’t think Postgres will be able to benefit from direct io? I might be wrong though!

I use Postgres with 32K BLKSZ.

I am actually using default 128K zfs recordsize, in a mixed workload, I found overall performance nicer than matching at 32K, and compression is way better.

> Thanks for sharing - and cool website!

Thank you!


Curious. At that scale and transaction rate, are you deleting / offloading rows to another storage solution after some amount of time? I’m assuming you’re not just letting 150,000 rows a second accumulate indefinitely.


> I insert about 150,000 rows a second

That's amazing - I would love to know if you have done careful data modeling, indexing, etc that allows you to get to this and what kind of data is being insert ed?


I am not optimizing too much around insertion speed. I avoid GIN, GIST and hash indexes.

The schema is nicely normalized.

I had troubles with hash indexes requiring hundreds of gigabytes of memory to rebuild.

Postgres B-Trees are painless and (very) fast.

Eg. querying one table by id (redacted):

    EXPLAIN ANALYZE SELECT * FROM table_name WHERE id = [ID_VALUE];

    Index Scan using table_name_pkey on table_name  (cost=0.71..2.93 rows=1 width=32) (actual time=0.042..0.042 rows=0 loops=1)
     Index Cond: (id =  '[ID_VALUE]'::bigint)
    Planning Time: 0.056 ms
    Execution Time: 0.052 ms

Here’s a zpool iostat 1

    # zpool iostat 1
    operations     bandwidth 
     read  write   read  write
    -----  -----  -----  -----
     148K   183K  1.23G  2.43G
     151K   180K  1.25G  2.36G
     151K   177K  1.25G  2.33G
     148K   153K  1.23G  2.13G


Have you measured _transactions_ per second? For both “insert only” and “mixed insert/update/delete?” This might be a more telling metric in terms of the stress it puts on the system. There’s a big difference between 1 transaction with 150k inserts, and 150k transactions with 1 insert each – especially when the workload shifts to include updates in the middle of those.

If you’re doing insert only, you might benefit from copying directly to disk rather than going through the typical query interface.


thank you for the response and stats! very cool


150k rows/second is pretty ordinary if you do batching, especially for a machine of that size. I was able to get 250k+ row inserts on a 16vCPU on a table with 16 columns, a third of which are strings >30 bytes. Pretty sure you can push that much higher with COPY.


that's true, you can get very far with COPY


Curious as to why it’s that many inserts? The whole CT ecosystem has about 30-40,000 new certs/precerts a minute IIRC). Love merkelmap though!


There’s backlog I still need to process :)


Because no one else cares to ask, are you running FreeBSD? I ask because you use ZFS.


Debian!


Damn, a stud! ZFS on Linux.


Ahaha :)


Can you please share some tips and tricks for achieving such high throughput?


That's amazing! Could you elaborate more on your VACUUM planning?


Mostly, be careful with long-running transactions (hours long), and modify your autovacuum settings to be as aggressive as possible based on your workload. Be careful with the freezing threshold too.

I ran into

    multixact "members" limit exceeded
Quite a bit when starting out :)


Can you recommend a good rule of thumb for autovacuum settings given a large workload like yours?


Here's mine:

- autovacuum_max_workers to my number of tables (Only do so if you have enough IO capacity and CPU...).

- autovacuum_naptime 10s

- autovacuum_vacuum_cost_delay 1ms

- autovacuum_vacuum_cost_limit 2000

You probably should read https://www.postgresql.org/docs/current/routine-vacuuming.ht... it's pretty well written and easy to parse!


Do you ever need to VACUUM FULL?


It’s too slow at that scale, pg_squeeze works wonders though.

I only “need” that because one of my table requires batch deletion, and I want to reclaim the space. I need to refactor that part. Otherwise nothing like that would be required.


Yes, I was asking because you have mention deletes. Thanks for the answer, cool stuff!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: