> To be clear—you want Postgres to manage the data (so that it gets replicated through Postgres replication), but you also want the data to exist in somewhere other than the DB heap?
> As far as I can tell, these two conditions together form a bit of a bind: for Postgres to manage the data in a way that would enable replication under MVCC, it has to have said data mmap(3)ed and indexed and have it participate in the WAL log and so forth. Just the space overhead of this management metadata will then be prohibitively costly in memory, if you have "tens of billions of files" to keep under management.
There's really not too much of a technical problem here. What the sub-op is complaining about mainly just some work that needs to be put into parts of postgres that have been designed long ago. Not entirely trivial due to compatibility concerns, but also not super hard. The first thing would be to have a separate type of toast table with 64bit ids (can't drop support for 32 bit ids without making in-place upgrades impossible), and to have per table toast id. Then use same infrastructure for the LOB piece.
Btw, postgres doesn't mmap() data. On some systems it uses mmap(MAP_ANONYMOUS) to allocate its buffer cache, but that's just a configurable size.
Right, I didn't mean to suggest that Postgres currently mmap(3)s all data; rather, I meant to get across that—in the scenario where you really want to keep all this data laying around as files rather than as LOBs—Postgres would need to have some way to guarantee that the data that's sitting around in those files is always in the state Postgres thinks it's in, if the corresponding file-streams are going to join in the same MVCC transactions as everything else. From what I can tell, that would require Postgres to do whatever it does for each table backing-file it keeps, to each of those files: both on the OS level (at least fopen(3) + flock(3)), and in terms of book-keeping in the system catalog.
Let me put it this way: how much overhead do you think there would be if you split your Postgres data across "ten billion" tablespaces? Because that's essentially what's being talked about here—little tablespaces containing one {oid, LOB} table, with one row, with a "storage engine" that represents that tablespace as a file.
...of course, if you do just (fix and then) use LOBs rather than insist on externally-visible files, none of that matters. :)
> As far as I can tell, these two conditions together form a bit of a bind: for Postgres to manage the data in a way that would enable replication under MVCC, it has to have said data mmap(3)ed and indexed and have it participate in the WAL log and so forth. Just the space overhead of this management metadata will then be prohibitively costly in memory, if you have "tens of billions of files" to keep under management.
There's really not too much of a technical problem here. What the sub-op is complaining about mainly just some work that needs to be put into parts of postgres that have been designed long ago. Not entirely trivial due to compatibility concerns, but also not super hard. The first thing would be to have a separate type of toast table with 64bit ids (can't drop support for 32 bit ids without making in-place upgrades impossible), and to have per table toast id. Then use same infrastructure for the LOB piece.
Btw, postgres doesn't mmap() data. On some systems it uses mmap(MAP_ANONYMOUS) to allocate its buffer cache, but that's just a configurable size.