The native table partitioning removes one of the headaches of inheritance, but it mostly only relieves some of the administrative headaches - the foreign key issues remain.
fdw's are great, but they don't participate in replication - which is a pretty big issue. Unfortunately, the "proper" way to do this would be with a custom type, but I can't find any way you could write one with the extension API that wouldn't be stored in the database heap (you could create some native functions similar to the existing LOB functions, but then you run into atomicity issues again).
Honestly, this would be easy to write if the extension API supported off-heap storage of custom types - though I understand there's a ton of technical complexity in implementing such a feature.
To be clear—you want Postgres to manage the data (so that it gets replicated through Postgres replication), but you also want the data to exist in somewhere other than the DB heap?
As far as I can tell, these two conditions together form a bit of a bind: for Postgres to manage the data in a way that would enable replication under MVCC, it has to have said data mmap(3)ed and indexed and have it participate in the WAL log and so forth. Just the space overhead of this management metadata will then be prohibitively costly in memory, if you have "tens of billions of files" to keep under management.
Personally, I think the pragmatic solution would be to replicate the files themselves outside of Postgres, and have Postgres just hold opaque references to their paths, which you would join through an FDW to fill a column with their contents.
As you say, the files are Write-Once-Read-Many—so you (presumably) aren't worried about how Isolated or Consistent writes to the file data would be without Postgres. Plain filesystems provide Atomicity and Durability guarantees all by themselves. It's less convenient to do things this way—you have to manage e.g. "WAL-E + a separate rsync" rather than just WAL-E—but it's not particularly bad in terms of the engineering trade-offs.
> To be clear—you want Postgres to manage the data (so that it gets replicated through Postgres replication), but you also want the data to exist in somewhere other than the DB heap?
> As far as I can tell, these two conditions together form a bit of a bind: for Postgres to manage the data in a way that would enable replication under MVCC, it has to have said data mmap(3)ed and indexed and have it participate in the WAL log and so forth. Just the space overhead of this management metadata will then be prohibitively costly in memory, if you have "tens of billions of files" to keep under management.
There's really not too much of a technical problem here. What the sub-op is complaining about mainly just some work that needs to be put into parts of postgres that have been designed long ago. Not entirely trivial due to compatibility concerns, but also not super hard. The first thing would be to have a separate type of toast table with 64bit ids (can't drop support for 32 bit ids without making in-place upgrades impossible), and to have per table toast id. Then use same infrastructure for the LOB piece.
Btw, postgres doesn't mmap() data. On some systems it uses mmap(MAP_ANONYMOUS) to allocate its buffer cache, but that's just a configurable size.
Right, I didn't mean to suggest that Postgres currently mmap(3)s all data; rather, I meant to get across that—in the scenario where you really want to keep all this data laying around as files rather than as LOBs—Postgres would need to have some way to guarantee that the data that's sitting around in those files is always in the state Postgres thinks it's in, if the corresponding file-streams are going to join in the same MVCC transactions as everything else. From what I can tell, that would require Postgres to do whatever it does for each table backing-file it keeps, to each of those files: both on the OS level (at least fopen(3) + flock(3)), and in terms of book-keeping in the system catalog.
Let me put it this way: how much overhead do you think there would be if you split your Postgres data across "ten billion" tablespaces? Because that's essentially what's being talked about here—little tablespaces containing one {oid, LOB} table, with one row, with a "storage engine" that represents that tablespace as a file.
...of course, if you do just (fix and then) use LOBs rather than insist on externally-visible files, none of that matters. :)
A filesystem may guarantee atomic and durable transactions, but then you're in a bind trying to keep things in sync with the database - either potentially ending up with orphans or missing data. In theory you could use a 2-phase commit approach, but that comes with its own headaches.
Consistency is actually a huge issue for DR purposes, right now with the DB+filesystem approach it's nearly impossible to get a clean restore from backups if needed. We could alleviate this a bit by using something like GlusterFS' geo-replication support and replicas offsite, then verifying the tail of the data on-disk, but it would be nice to have a solution that could just work correctly with PostgreSQL's built-in replication (just like FILESTREAM does with MSSQL).
Edit: I should add, PostgreSQL's existing LOB support already works great with binary replication. The biggest issue with them is simply the 4B record limit, due to the use of OID's. It'd be nice if it was hidden away automatically with a special BYTEA-variant like FILESTREAM does with MSSQL, but it's "good enough" without that limitation.
Would the "native table partitioning" feature introduced in this release be a solution (or, at least, more optimal) for this?
Alternately, if you want to go off-heap, what about using Foreign Data Wrappers (https://wiki.postgresql.org/wiki/Foreign_data_wrappers#File_...)? These two both sound like they might solve your problem:
• https://github.com/ZhengYang/dc_fdw/wiki
• http://multicorn.org/