Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL reconsiders its process-based model (lwn.net)
825 points by todsacerdoti on June 19, 2023 | hide | past | favorite | 372 comments



    For the record, I think this will be a disaster.  There is far too much
    code that will get broken, largely silently, and much of it is not
    under our control.
        regards, tom lane
(via https://lwn.net/ml/pgsql-hackers/4178104.1685978307@sss.pgh....)

If Tom Lane says it will be a disaster, I believe it will be a disaster.


Reminds me of PHP 6...

For those who don't follow PHP closely - that version was an attempted refactor of the string implementation which essentially shut down nearly all work on PHP for a decade, stagnating the language until it became pretty terrible compared to other options. They finally gave up and started work on PHP 7 which uses the (perfectly good) PHP 5 strings.

Ten years of wasted time by the best internal PHP developers crippled the project - I'm amazed it survived at all.


On the other hand, there's also the case of the Lunar Module guidance software that was hard-coded to run exactly every two seconds. If the previous subroutine call was still running when the next one was due, the previous one was harshly terminated (with weird side effects).

One of the main programmers suggested making it so that the next guidance routine wouldn't run until the previous one was done. This would make the code less sensitive to race conditions and allow more useful functionality for the pilots (who were the actual users and did seem to want it). However everyone assumed the two-second constant was implicitly embedded everywhere.

It wasn't -- only in a few places -- and with that fixed the code got more general and the proof of concept ran better than ever in about every simulator available. The amount of control it gave pilots was years ahead of the curve. But it never got a chance to fly on a real mission because what was there was "good enough" and nobody bothered to try.

In our combined comments there's a lesson about growing experiments and figuring out how to achieve failure quickly.


Things You Should Never Do

https://www.joelonsoftware.com/2000/04/06/things-you-should-...

An oldie but a goodie


I think this is a great article that takes a maximalist point and that’s its flaw.

You should rewrite code only when the cost of adding a new feature (one that is actually necessary) to the old codebase becomes comparable to designing your entire system from scratch to allow for that feature to be added easily. That is to say that the cost of the rewrite should become comparable to the cost of continuing development. I have been a part of a couple of rewrites like that, one of them quite complex, and yes they were warranted and yes they worked.

But having said that you should absolutely be conservative with rewriting code. It’s a bad habit to always jump to a rewrite.


I think it’s very dependent on how you use words like “rewrite” or “refactor”. The point the author makes about the two page function, and all the bug-fixes (lessons learned) makes sense only if you “rewrite” from scratch without looking at the history. You can absolutely “rewrite” the function in a manner that is “refactoring”, but will often get called “rewrite” in the real world. This may be because “refactor” is sort of this English CS term that doesn’t have a real translation or usage in many languages and “rewrite” is sort of universal for changing text, but in CS is sort of “rebuilding” things.

I don’t think you necessarily need to be conservative about rewriting things. We do it all the time in fact. We build something to get it out there and see the usage, and then we build it better and then we do it again. Which often involves a lot of “rewriting” but thanks to principles like SOLID’s single responsibility makes this rather easy to both do and maintain (we write a lot of semi-functional code and try to avoid using OOP unless necessary, so we don’t really use all the parts of SOLID religiously).

I do agree that it’s never a good idea to get into things with the mind-set of “we can do this better if we start from scratch” because you can’t.


There's currently a trend towards shitting on microservices-everything, imo largely justified. But missing from that is that identifying a logical feature and moving it to a microservice is one of the safer ways to begin a gradual rewrite of a critical system. And usually possible to get cross-dept buyin for various not always wholesome reasons. It may not always be the best technical solution but it's often the best political one when a rewrite is necessary.


>identifying a logical feature and moving it to a microservice is one of the safer ways to begin a gradual rewrite of a critical system

Why not identify that same logical feature and move it into a library. How does a Microservice add value here?

Identifying, extracting and adding tests to logical features has been the sane way to rewrite software for ages. Michael feathers even wrote a book about it [1]. This ship of theseus approach works because it's incremental and allows for a complete rewrite without ever having non-functional software in between.

Adding a REST/networking/orchestration boundary to make it a Microservice just for the sake of extracting it adds a lot of complexity for no gain.

Microservice can be the right architecture, but not if all you want is to extract a library.

[1] https://understandlegacycode.com/blog/key-points-of-working-...


The issue here isn't "rewriting" per se but "stopping development".

You shouldn't stop development on your important products.

Letting a couple of your talented programmers loose on a greenfield reimplementation is a perfectly sane strategic move.

Stopping development on important products because you are 100% certain that the reimplementation will be successful by $DEADLINE is a foolish gamble.


The big problem there is that the people you are letting loose on the alternative, are lost from the original, so O loses steam that A gains. You still have to produce bug fixes and features to _both_ O and A to keep them in sync. So you essentially have a doubled required production rate to be delivered using the same staff.

So in order for there to be a net gain, the gang working on the alternative have to be able to find such big wins as to being neigh impossible.

This is a very very hard problem in our domain. 99% of the time, we have to simply resist the urge to _just rewrite the sucker_. No! Don't do it! (And this is incredibly hard because we all want to.)


Getting some Mythical Man Month vibes here. Productivity isn't a zero-sum game.


I'm saying that the productivity gain would have to be incredibly large since it has to encompass a doubled output of features and bug fixes for a net zero change.

Say you have product A with features P, Q, R, and S; writing product B has to reproduce P, Q, R, and S, plus X and Y that is currently produced by the A team. On top of that, it has to fix (conceptual) bugs within P, Q, R, and S. All this is to be done by the new, crack, team that aims to make it so that: cost-of-development(B) < cost-of-development(A).

But the point is that the difference in magnitude of cost-of-development(B) and cost-of-development(A) has to be rather large considering the amount of work needed to have a return on that investment at all.


Isn’t that how you end up with Python 2 and 3 though?


Yes, that was a rough migration process, but the long-term result is we have an improved language and growing community instead of Python going the way of PHP and Perl.

Between the 3 P's, Python's strategic decisions in 2000s were clearly the most successful.

And it wasn't a total rewrite.


Python3 changes were many "little" things; some more fundamental than other (unicode str). So I guess they were able to split the work in tiny pieces and, ultimately, were able to manage the project...


The problem is if it snowballs


Any project, whether it's a "rewrite" or not, can succumb to scope creep and poor project management.


> Well, yes. They did. They did it by making the single worst strategic mistake that any software company can make:

> They decided to rewrite the code from scratch.

Absolutely not what's being proposed for Postgres.


Process isolation affects so many things in C. The strategy change is going to require changes to so many modules that it will either be a re-write or buggy.

In practical terms, if every line needs to be audited and updated, it is a re-write


What makes you think that it will require that many changes? There will be some widespread mechanical changes (which can be verified to be complete with a bit of low level work, like a script using objdump/nm to look for non-TLS mutable variables) and some areas changing more heavily (e.g. connection establishment, crash detection, signal handling, minor details of the locking code). But large portions of the code won't need to change. Note that we/postgres already shares a lot of state across processes.


I'm not the person you asked and I don't have any particular knowledge of postgres internals.

Experience with other systems has taught me that in a system that's been in active use and development for decades, entanglement will be deep, subtle, and pervasive. If this isn't true of postgres then it's an absolute freak anomaly of a codebase. It is that in other ways, so it's possible.

But the article mentions there being thousands of global variables. And Tom Lane himself says he considers it untenable for exactly this reason. That's a very good reason to think that it will require that many changes imo.


> that many changes

The 'that many changes' in question is a complete rewrite. Many changes across many files, yes, but nothing even approaching a rewrite.

> I don't have any particular knowledge of postgres

Judging by their bio, the person you're replying to does.


> either be a re-write or buggy

A large refactor at best. It will touch lots of parts of the code base, but the vast majority of the source code would remain intact. Otherwise they could just Rewrite it in Rust™ while they’re at it

> if every line needs to be audited and updated, it is a re-write

I’m not sure why you believe every line needs to be updated. Most code is thread agnostic.


The lesson that "cruft is problems someone solved before you" is unfortunately entirely lost on most devs today


Love this article. Completely changed the way I think about certain projects.


I've used PHP in the past (PHP 4 and 5), as well as some simple templated projects in PHP 7. I try to keep up on news with what is happening in the PHP world, and it's difficult because of the hate for the language. Is the solution to Unicode strings still to just use the "mb_*" functions?

I got my real professional start using PHP, and have built even financial systems in the language (since ported to .NET 6 for my ease of maintenance, and better number handling). I'm still very interested in the language itself, in case I ever have the need to freelance or provide a solution to a client that can't afford what I can build in .NET (although to be honest, at this point I'm roughly able to code at the same speed in .NET as in PHP, but with the added type-safety, although I know PHP has really stepped up in providing this).


I believe so - most (all?) string functions have an mb_ equivalent, for working on multibyte strings.

Regular PHP strings are actually pretty great, since you can treat them like byte arrays. Fun fact: PHPs streaming API has an “in-memory” option and it’s… just a string under the hood.

Just don’t forget to use multibyte functions when you’re handling things like user input.


I have the "Professional PHP6" book which I feel like should be a collectors item or something.

Weird book IMO, because it has a lot of content that's just about general software development, rather than anything to do with PHP specifically, or the theoretical PHP6 APIs in particular.


PHP used to be the first computer language learned by people wanting to create a scripted web page. This was more true in the 90s but maybe it stuck. So it would be OK to add some general guidance about writing software and organizing projects.


Implying it had ever not been terrible compared to other options


What was the specifics about the string refactor implementation? I can’t find anything about it online


    I don't expect you or others to buy into any particular code change at 
    this point, or to contribute time into it. Just to accept that it's a 
    worthwhile goal. If the implementation turns out to be a disaster, then 
    it won't be accepted, of course. But I'm optimistic.
The reply is much more reasonable than this blanket assertion of a disaster.


As an outsider it doesn't sound like something a few people could spin off in a branch in a couple months and see how code review goes. They're talking about doing it over multiple (yearly?) releases. It seems like it'll take a lot of expert attention, which won't be available for other work and the changes themselves will impact all other ongoing work.

I'm not trying to naysay it per se, bc again I don't have technical knowledge of this codebase. But that's exactly the sort of scenario that can cause a large project to splinter or stall for years. Talking about "the implementation" absent the context that would be necessary to create that implementation seems naively optimistic, or at worst irresponsible.


You are talking about implementation, the OP was talking about raising the concept with interested parties and seeing whether it is worth even starting to think about it.

They could fork, they could add threading to some sub systems and roll it out over several versions.

I don't know enough about the code but, of course, it is a hard problem but the solution might be to build it from the ground up as a threaded system, using the skills learned over 30 years and taking the hit on the rebuild instead of reworking what is there.

I am most interested because I didn't realise there was a performance problem in the first place.


Am I going crazy, or has the obvious implementation of such a change been missed on people? If they were proposing taking a multi-threaded app and splitting it into a multi-process one, I would predict they would find a hell of a lot of unexpected or unknown implicit communication between threads, which would be a nightmare to untangle.

Going the other way, there is an extremely well understood interface between all the processes which run in isolation: shared memory. Nearly by definition this must be well coordinated between the processes.

So the first step in moving to a multi-threaded implementation would be to change nearly nothing about each process, and then just run each process in its own pthread, keeping all the shared memory ‘n all.

You would expect performance to be about the same, maybe a little better with the reduces TLB churn, but the architecture is basically unchanged. At that point, you can start to look at what are more appropriate communication/synchronisation mechanisms now you’re working in the same address space.

I just don’t understand why so many people seem to think this requires an enormous rewrite - having developed as a multi-process system means you’ve had to make so much of the problematic things explicit and control for them, and none of these threads would know anything at all about each other’s internals.


This should be considered a research effort, assuming it will be a complete rewrite. In light of that, you should not draw down resources from the established code base to work on it.

Ignoring the above, first state the explicit requirements driving this change and let people weigh in on those. This sounds like a geeky dev itch.


Maybe a better option would be finding a team to create nugres, aka a fork for this and other experiments. So that mainline remains stable.


There are several forks of PostgreSQL, in various levels of license, additional features and activity. However, maintaining a fork in addition to a main project is inherently more expensive than maintaining just a single project, so adding features to new major releases of the main project is generally preferred over forking every release into its own, newly named, project. After all, that is what we have major (feature) releases and stabalization windows (beta releases) for.


This won't work well for a multiyear project.. Either you have to stall the release process, divide it into smaller parts or fork.


Yeah.

Without being familiar with the Postgres source, this seems to be what I call a "somersault problem": hard to break down into sub-goals. I have heard that the Postgres codebase is solid which makes it easier but it's still mature and highly complex. It doesn't sound feasible to me.

https://kristiandupont.medium.com/somersault-problems-69c478...


The original post does describe several sub-problems. The group could first chip away at global state, signals, libraries. They can do this before changing the process model in any way.


Good point.


That's an awful message with the only sensible reply.


Heikki Linnakangas has a good understanding of Postgres as well however. We all want Postgres to be competitive with numbers of connections, don't we?


Feel like the PostgreSQL Core Team should just build a new database from scratch using what they have learned from experience instead of attempting such a fundamental architectural migration. It would give them more freedom to change things also. Call it "postgendb" and provide a data migrator.


That's a great idea. I've been considering whether or not to use Cockroach Db at work, and I love the fact that it's distributed from the get go.

Why not work on something like that instead of changing something that works? Especially since they the process model really only runs into trouble on large systems.


He is right. Such rewrites cause a lot of problems if your compiler doesn't help you with avoiding data races.

But there is another way.


> But there is another way.

Ok?


The person probably implied that Postgres should switch to another toolchain that guarantees more things at compile time, so probably Rust.


If the existing code is old-school enough to use thousands of global variables in a thread-unsafe way, seems like changing it enough to compile as safe Rust code would push the "non-trivial" envelope pretty far.


You can take a chunk of code and just rewrite it in Rust. You'll learn a lot quickly by this.


The boundaries within database code are not clear. There are too many interlocking parts to take a nontrivial chunk and rewrite it Rust.


It’s sort of like the inverse of the Matrix when Neo learns kung fu. You realize that you actually don’t know how to program :)


Microsoft SQL Server has SQLOS which is another way [0].

[0] https://www.thegeekdiary.com/what-is-sql-server-operating-sy...


I think it's meant to imply the solution given in their username ("idiomatic Rust").


> I think it's meant to imply the solution given in their username ("idiomatic Rust").

I think "Idiom: a tic (Rust)" can also fit if I squint hard enough and decide it looks like a definition from an online dictionary :-)


Don’t mind the gimmick gallery (username).


Indeed, Zig is a nice language for this


Sorry if I offend anybody, but this sounds like such a bad idea. I have been running various versions of postgres in production for 15 years with thousands of processes on super beefy machines, and I can tell you without a doubt that sometimes those processes crash - specially if you are running any of the extensions. Nevertheless, Postgres has 99% of the time proven to be resilient. The idea that a bad client can bring the whole cluster down because it hit a bug sounds scary. Every try creating a spatial index on thousands/millions of records that have nasty overly complex or badly digitized geometries? Sadly, crashes are part of that workflow, and changing this from process to threading would mean all the other clients also crashing and cutting connections. This as a potential problem because I want to avoid context switching overhead or cache misses, no thanks.


However, it's already the case that if a postgres process crashes, the whole cluster gets restarted. I've occasionally seen this message:

    WARNING: terminating connection because of crash of another server process
    DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
    HINT: In a moment you should be able to reconnect to the database and repeat your command.
    LOG: all server processes terminated; reinitializing


> However, it's already the case that if a postgres process crashes, the whole cluster gets restarted. I've occasionally seen this message:

Sure, but the blast radius of corruption is limited to that shared memory, not all the memory of all the processes. You can at least use the fact that a process has crashed to ensure that the corruption doesn't spread.

(This is why it restarts: there is no guarantee that the shared memory is valid, so the other processes are stopped before they attempt to use that potentially invalid memory)

With threads, all memory is shared memory. A single thread that crashes can make other threads data invalid before the detection of the crash.


yes, but postmaster is still running to roll back the transaction. If you crash a single multi-threaded process, you may lose postmaster as well and then sadness would ensue


The threaded design wouldn't necessarily be single-process, it would just not have 1 process for every connection. Things like crash detection could still be handled in a separate process. The reason to use threading in most cases is to reduce communication and switching overhead, but for low-traffic backends like a crash handler the overhead of it being a process is quite limited - when it gets triggered context switching overhead is the least of your problems.


Seconded. For instance, Firefox' crash reporter has always been a separate process, even at the time Firefox was mostly single-process, single-threaded. Last time I checked, this was still the case.


If you read the thread you’d see the discussion includes still having e.g. postmaster as a separate process.


PostgreSQL can recover from abruptly aborted transactions (think "pulled the power cord") by replaying the journal. This is not going to change anyway.


Transaction roll back is a part of the WAL. Databases write to the disk an intent to change things, what should be changed, and a "commit" of the change when finished so that all changes happen as a unit. If the DB process is interrupted during that log write then all changes associated with that transaction are rolled back.

Threaded vs process won't affect that.


Running the whole DBMS as a bunch of threads in single process changes how fast is the recovery from some kind of temporary inconsistency. In the ideal world, this should not happen, but in reality it does and you do not want to bring the whole thing down because of some superficial data corruption.

On the other hand, all cases of fixable corrupted data in PostgreSQL I have seen were result of somebody doing something totally dumb (rsyncing live cluster, even between architectures), while on InnoDB it seems to happen somewhat randomly without any obvious reason of somebody doing stupid things.


We would still have a separate process doing that part of postmaster's work.


You can still have a master control process separate from the client connections.


Restart on crash doesn't sound that difficult to do.


Reading your comment makes me think it is not only a good idea, it is a necessity.

Relying on crashing as a bug recovery system is a good idea? Crashing is just part of the workflow? That's insane, and a good argument against PostgreSQL in any production system.

It is possible PostgreSQL doesn't migrate to a thread based model, and I am not arguing they should.

But debug and patch the causes of these crashes? Absolutely yes, and the sooner, the better.


A database has to handle situations outside its control, e.g. someone cutting the power to the server. That should not result in a corrupted database, and with Postgres it doesn't.

The fundamental problem is that when you're sharing memory, you cannot safely just stop a single process when encountering an unexpected error. You do not know the current state of your shared data, and if it could lead to further corruption. So restarting everything is the only safe choice in this case.


Cars are designed with airbags?!

Like, they are supposed to crash?!?


> Relying on crashing as a bug recovery system is a good idea? Crashing is just part of the workflow? That's insane

Erlang users don't seem to agree with you


We do fix crashes etc, even if the postgres manages to restart.

I think the post upthread references an out-of-core extension we don't control, which in turn depends on many external libraries it doesn't control either...


It's all about trade off.

Building a database which is never gonna crash might be possible but at what cost? Can you name any single real world system archived that? Also, there can be a regression. More tests? Sure but again, at what cost?

While we are trying to get there, having a crash proof architecture is also a very practical approach.


We don't want stuff to crash. But we also want data integrity to be maintained. We also want things to work. In a world with extensions written in C to support a lot of cool things with Postgres, you want to walk and chew bubblegum on this front.

Though to your point, a C extension can totally destroy your data in other ways, and there are likely ways to add more barriers. And hey, we should fix bugs!


They are still debugging and patching the causes. The crash detection is just to try and prevent a single bug from bringing down the whole system.


Is the actual number you got 99%? Seems low to me but I don’t really know about Postgres. That’s 3 and a half days of downtime per year, or an hour and a half per week.


Well, hour and half per week is the amount of downtime that you need for modestly sized database (units of TB) accessed by legacy clients that have ridiculously long running transactions that interfere with autovacuum.


Also, reducing context switching overhead (or any other CPU overhead) is probably not gonna fix the garbage I/O performance.


I'm honestly surprised it took them so long to reach this conclusion.

> That idea quickly loses its appeal, though, when one considers trying to create and maintain a 2,000-member structure, so the project is unlikely to go this way.

As repulsive as this might sound at first, I've seen structures of hundreds of fields work fine if the hierarchy inside them is well organized and they're not just flat. Still, I have no real knowledge of the complexity of the code and wish the Postgres devs all the luck in the world to get this working smoothly.


> I'm honestly surprised it took them so long to reach this conclusion.

I'm not. You can get a long way with conventional IPC, and OS processes provide a lot of value. For most PostgreSQL instances the TLB flush penalty is at least 3rd or 4th on the list of performance concerns, far below prevailing storage and network bottlenecks.

I share the concerns cited in this LWN story. Reworking this massive code base around multithreading carries a large amount of risk. PostgreSQL developers will have to level up substantially to pull it off.

A PostgreSQL endorsed "second-system" with the (likely impossible, but close enough that it wouldn't matter) goal of 100% client compatibility could be a better approach. Adopting a memory safe language would make this both tractable and attractive (to both developers and users.) The home truth is that any "new process model" effort would actually play out exactly this way, so why not be deliberate about it?


From what I gather postgres isn't doing conventional IPC but instead it uses shared memory, which means the same mechanism threads use but with way higher complexity


As does Oracle, and others. I'm aware.

IPC, to me, includes the conventional shared memory resources (memory segments, locks, semaphores, condition variable, etc.) used by these systems: resources acquired by processes for the purpose of communication with other processes.

I get it though. The most general concept of shared memory is not coupled to an OS "process." You made me question whether my concept of term IPC was valid, however. So what does one do when a question appears? Stop thinking immediately and consult a language model!

Q: Is shared memory considered a form of interprocess communication?

GPT-4: Yes, shared memory is indeed considered a form of interprocess communication (IPC). It's one of the several mechanisms provided by an operating system to allow processes to share and exchange data.

...

Why does citing ChatGPT make me feel so ugly inside?


I always understood IPC, "interprocess communication", in general sense, as anything and everything that can be used by processes to communicate with each other - of course with a narrowing provision that common use of the term refers to those means that are typically used for that purpose, are relatively efficient, and the process in question run on the same machine.

In that view, I always saw shared memory as IPC, in that it is a tool commonly used to exchange data between processes, but of course it is not strictly tied to any process in particular. This is similar to files, which if you squint are a form of IPC too, and are also not tied to any specific process.

> Why does citing ChatGPT make me feel so ugly inside?

That's probably because, in cases like this, it's not much different to stating it yourself, but is more noisy.


Without a credible source to reconfirm what ChatGPT said, one can’t really assume what ChatGPT says is correct.


> Why does citing ChatGPT make me feel so ugly inside?

Its the modern let me Google that for you. Just like people don't care what the #1 result on Google is, they also don't care what ChatGPT has to say about it. If they did, they'd ask it themselves.


Not necessarily. Man 3 shmem if you want a journey back to some bad ideas.


What do you think IPC is?


Would this basically be a new front end? Like the part that handles sockets and input?

Or more if a rewrite of subsystems? Like the query planner or storage engine etc?


Both, I'd imagine.

With regard to client compatibility there are related precedents for this already; the PostgreSQL wire protocol has emerged as a de facto standard. Cockroachdb and ClickHouse are two examples that come to mind.


Would something like opt-in sharing of pages between processes that oracle has been trying to get into kernel be the correct option: https://lwn.net/ml/linux-kernel/cover.1682453344.git.khalid....

Postmaster would just share the already shared memory between processes (containing also the locks). That explicit part of memory would opt-in to thread -like sharing and thus get faster/less tlb switching and lower memory usage. While all the rest of the state would still be per-process and safe.

tl;dr super share the existing shared memory area with kernel patch

All operating systems not supporting it would keep working as is.


Yes, it would mitigate the TLB problem. Interesting that Oracle is also looking to solve this problem, but not by multithreading the Oracle RDBMS.


Yeah. I think as a straightforward, easily correct transition from 2000 globals, a giant structure isn't an awful idea. It's not like the globals were organized before! You're just making the ambient state (awful as it is) explicit.


We did this with a project I worked on. I came on after the code was mature.

While we didn't have 2000 globals, we did have a non-trivial amount, spread over about 300kLOC of C++.

We started by just stuffing them into a "context" struct, and every function that accessed a global thus needed to take a context instance as a new parameter. This was tedious but easy.

However the upside was that this highlighted poor architecture. Over time we refactored those bits and the main context struct shrunk significantly.

The result was better and more modular code, and overall well worth the effort in our case, in my opinion.


> I think as a straightforward, easily correct transition from 2000 globals, a giant structure isn't an awful idea.

Agree.

> It's not like the globals were organized before!

Using a struct with 2000 fields loses some encapsulation.

When a global is defined in a ".c" file (and not exported via a ".h" file), it can only be accessed in that one ".c" file, sort of like a "private" field in a class.

Switching to a single struct would mean that all globals can be accessed by all code.

There's probably a way to define things that allows you to regain some encapsulation, though. For example, some spin on the opaque type pattern: https://stackoverflow.com/a/29121847/163832


No that is what a static in a .c file is for.

A plain global can be accessed from other compiled units - agreed with no .h entry it is my=uch more error prone e.g. you don't know the type but the variables name is exposed to other objects


Wouldn't those statics also be slated for removal with this change?


At most they'd be determined to be read only constants that are inlined during constant folding. This includes most integral sized / typed scalar values that fit into registers for the most part, and nothing you've taken the address of either - those remain as static data.


I think there might be a terminology mix-up here. In C, a global variable with the `static` keyword is is still mutable. So it typically can't be constant-folded/inlined.

The `static` modifier in that context just means that the symbol is not exported, so other ".c" files can't access it.


A static variable in C is mutable in the same sense that a local variable is, but since it's not visible outside the current compilation unit the optimizer is allowed to observe that it's never actually modified or published and constant fold it away.

Check out the generated assembly for this simple program, notice that kBase is folded even though it's not marked const: https://godbolt.org/z/h45vYo5x5


It is also possible for a link-time optimizer to observe that a non-static global variable is never modified and optimize that away too.

But the Postgres mailing list is talking about 2000 global variables being a hurdle to multi-threading. I doubt they just didn't realize that most of them can be optimized into constants.


Yea. Just about none of them could be optimized to constants because, uh, they're not constant. We're not perfect, but we do add const etc to TU level statics/globals that are actually read only. And if they are actually read only, we don't care about them in the context of threading anyway, since they wouldn't need any different behaviour anyway.


Exactly, if you're now forced to put everything in one place you're forced to acknowledge and understand the complexity of your state, and might have incentives to simplify it.


Here's MySQL's all-session-globals-in-one-place-class: https://github.com/mysql/mysql-server/blob/8.0/sql/sql_class...

I believe I can safely say that nobody acknowledges and understands the complexity of all state within that class, and that whatever incentives there may be to simplify it are not enough for that to actually happen.

(It ends on line 4692)


Right but that would still be true if they were globals instead. Putting all the globals in a class doesn't make any difference to how much state you have.


> Putting all the globals in a class doesn't make any difference to how much state you have.

I didn't make any claims about the _amount_ of state. My claim was that “you're forced to acknowledge and understand the complexity of your state” (i.e., moving it all together in one place helps understanding the state) is plain-out wrong.


It's not wrong. Obviously putting it all in one place makes you consider just how much of it you have, rather than having it hidden away all over your code.


Yes, it’s the most pragmatic and it’s only “awful” because it makes the actual problem visible. And would likely encourage slowly refactoring code to handle its state in a more sane way, until you’re only left with the really gnarly stuff, which shouldn’t be too much anymore and you can put them in individual thread local storages.

It’s an easy transition path.


I think my bigger fear is around security. A process per connection keeps things pretty secure for that connection regardless of what the global variables are doing (somewhat hard to mess that up with no concurrency going on in a process).

Merge all that into one process with many threads and it becomes a nightmare problem to ensure some random addon didn't decide to change a global var mid processing (which causes wrong data to be read).


All postgres processes run under the same system user and all the access checking happens completely in userspace.


Access checking, yes, but the scope of memory corruption does increase unavoidably, given the main thing the pgsql-hackers investigating threads want: one virtual memory context when toggling between concurrent work.

Of course, there's a huge amount of shared space already, so a willful corruption can already do virtually anything. But, more is more.


I've never really been limited by CPU when running postgres (few TB instances). The bottleneck is always IO. Do others have different experience? Plus there's elegance and a feeling of being in control when you know query is associated with specific process which you can deal with and monitor just like any other process.

But I'm very much clueless about internals, so this is a question rather than an opinion.


I see postgres become CPU bound regularly: Lots of hash joins, copy from or to CSV, index or materialized view rebuild. Postgis eats CPU. Tds_fdw tends to spend a lot of time doing charset conversion, more than actually networking to mssql.

I was surprised when starting with postgres. Then again, I have smaller databases (A few TB) and the cache hit ratio tends to be about 95%. Combine that with SSDs, and it becomes understandable.

Even so, I am wary of this change. Postgres is very reliable, and I have no problem throwing some extra hardware to it in return. But these people have proven they know what they are doing, so I'll go with their opinion.


I've also definitely seen a lot of CPU bounding on postgres.


It's not just CPU - memory usage is also higher. In particular, idle connections still consume signficant memory, and this is why PostgreSQL has so much lower connection limits than eg. MySQL. Pooling can help in some cases, but pooling also breaks some important PostgreSQL features (like prepared statements...) since poolers generally can't preserve session state. Other features (eg. notify) are just incompatible with pooling. And pooling cannot help with connections that are idle but inside a transaction.

That said, many of these things are solvable without a full switch to a threaded model (eg. by having pooling built-in and session-state-aware).


> solvable without a full switch to a threaded model (eg. by having pooling built-in and session-state-aware).

Yeeeeesssss, but solving that is solving the hardest part of switching to a threaded model. It requires the team to come terms with the global state and encapsulating session state in a non-global struct.


> That said, many of these things are solvable without a full switch to a threaded model (eg. by having pooling built-in and session-state-aware).

The thing is that that's a lot easier with threads. Much of the session state lives in process private memory (prepared statements etc), and it can't be statically sized ahead of time. If you move all that state into dynamically allocated shared memory, you've basically paid all the price for threading already, except you can't use any tooling for threads.


I've generally had buffer-cache hit rates in the 99.9% range, which ends up being minimal read I/O. (This is on AWS Aurora, where these bo disk cache and so shared_buffers is the primary cache, but an equivalent measure for vanilla postgres exists.)

In those scenarios,there's very little read I/O. CPU is the primary bottleneck. That's why we run up as many as 10 Aurora readers (autoscaled with traffic).


>I've never really been limited by CPU when running postgres (few TB instances). The bottleneck is always IO.

Throw a few NVMe drives at it and it might.


Throw a ridiculous amount of RAM at it is more correct assessment. NVMe reads are still an “I/O” and that is slow. And for at least 10 years buying enough RAM to have all off the interesting parts of OLTP psql database either in shared_buffers or in the OS-level buffer cache is completely feasible.


> NVMe reads are still an “I/O” and that is slow

It's orders of magnitude faster than SAS/SATA SSDs and you can throw 10 of them into 1U server. It's nowhere near "slow" and still easy enough to be CPU bottlenecked before you get IO bottlenecked.

But yes, pair of 1TB RAM servers gotta cost you less than half year's worth of developer salary


an array of modern SSDs can get to a similar bandwidth to RAM, albeit with significantly worse latency still. It's not that hard to push the bottleneck elsewhere in a lot of workloads. High performance fileservers, for example, need pretty beefy CPUs to keep up.


Depends on your queries.

If you push a lot of work into the database including JSON and have a lot of buffer memory...CPU can easily be limiting.


With modern SSDs that can push 1M IOPs+, you can get into a situation where I/O latency starts to become a problem, but in my experience, they far outpace what the CPU can do. Even the I/O stack can be optimized further in some of these cases, but often it comes with the trade off of shifting more work into the CPU.


Postgres uses lots of cpu and memory if you have many connections and especially clients that come and go frequently. Pooling and bouncers help with that. That experience should better come out of the box, not by bolting on tools around it.


> I'm honestly surprised it took them so long to reach this conclusion.

On the contrary, it's been discussed for ages. But it's a huge change, with only modest advantages.

I'm skeptical of the ROI to be honest. Not that is doesn't have value, but that it has more value than the effort.


> it's a huge change, with only modest advantages

+significant and unknown set of new problems, including new bugs.

This reminds me of the time they lifted entire streets in Chicago by 14 feet to address new urban requirements. Chicago, we can safely assume, did not have the option of just starting a brand new city a few miles away.

The interesting question here is should a system design that works quite well upto a certain scale be abandoned in order to extend its market reach.


Yeah, and you will run headlong into other unforseen real world issues. You may never reach the performance goals.


Also, even if a 2k-member structure is obnoxious, consider the alternative - having to think about and manage 2k global variables is probably even worse!


Each set of globals is in a module it relates to, not in some central file where everything has to be in one struct.

If anything, it's probably easier to understand.


I think this is a situation where a message-passing Actor-based model would do well. Maybe pass variable updates to a single writer process/thread through channels or a queue.

Years ago I wrote an algorithmic trader in Python (and Cython for the hotspots) using Multiprocessing and I was able to get away with a lot using that approach. I had one process receiving websocket updates from the exchange, another process writing them to an order book that used a custom data structure, and multiple other processes reading from that data structure. Ran well enough that trade decisions could be made in a few thousand nanoseconds on an average EC2 instance. Not sure what their latency requirements are, though I imagine they may need to be faster.

Obviously mutexes are the bottleneck for them at this point, and while my idea might be a bit slower than a low-load situation, perhaps it would be faster when you start getting to higher load.


That would most likely be several times slower than current model


I think the Actor model is fine if you start there, but I can't imagine incrementally adopting it in a large, preexisting code base.


> I'm honestly surprised it took them so long to reach this conclusion.

Oracle also uses a process model on Linux. At some point (I think starting with 12.x), it can now be configured on Linux to use a threaded model, but the default is still a process-per-connection model.

Why does everybody think it's a bad thing in Postgres, but nobody thinks it's a bad thing in Oracle.


Well for one Postgress is open source and widely used. So anyone can pick it up and look at its internals, that's not the case for Oracle DB .


This is how I made my fork of libtcc lock-free.

Mainline has a lock so that all backends can use global variables, but only one instance can do codegen at a time.

It was a giant refactoring Especially fun was when multiple compilation units used the same static variable name, but it all worked in the end.


Out of curiosity, where is this fork? Sounds very interesting.


https://github.com/rsaxvc/tinycc-multithreaded

This is the multi-threaded compiler: https://github.com/rsaxvc/tcc-swarm

With the multi-threaded tcc above it scales about as well as multiprocess. With mainline it doesn't scale well at all.

So far I haven't gotten around to reusing anything across libtcc handles/instances, but would eventually like to share mmap()'d headers across instances, as well as cache include paths, and take invocation arguments through stdin one compilation unit per line.


I don't see the problem. All variables are either set in config or at runtime and then for every new query they are read and used by PostgreSQL (at least this is my understanding).

Regarding the threading issue, I think you can do the connections part multithreaded instead of one process per connection and still use IPC between this and postmaster. Because of the way PostgreSQL currently works, seems feasible to move parts one by one into a threaded model and instead of tens/hundreds of processes you can have just a few and a lot of threads.

Honestly, they should prototype it and see how it looks like and then decide on the way forward.


I don't get it. How is a 2000-member structure any different from having 2000 global variables? How is maintaining the struct possibly harder than maintaining the globals? Refactoring globals to struct members is semantically nearly identical, it may as well just be a mechanical, cosmetic change, while also giving the possibility to move to a threaded architecture.


Because global variables can be confined to individual cpp files, exclusively visible in that compilation unit. It makes them far easier to reason with than hoisting them to the "global and globally visible" option if you just use a gargantuan struct. Which is why a more invasive refactor might be required.


What if the global variable has a greater scope than just a single TU? For simple variables of limited scope this approach would work but for more complex variables that are impacting multiple "modules" in the code it would introduce yet another code design problem to solve.


Just use thread local variables.

I abuse them for ridiculous things.


Yeah, I was really into that before there was even a cross-compiler/cross-platform syntax for declaring TLS values in C++ but have since “upgraded” to avoiding TLS altogether where possible. The quality of the implementations vary greatly from compiler and platform to compiler and platform, you run into weird issues with thread_at exit if they’re not primitive types, they run afoul of any fibers/coroutines/etc that have since become extremely prevalent, and a few other things.


Thread locals are both blessing and a curse - the problem with them is that you have no lifetime control over such variables.


That is the plan for PostgreSQL.


> if the hierarchy inside them is well organized

is this another way to say "in a 2000 member structure, only 10 have significant voting power"?


This statement is not about people, it is about a C struct.


I recently looked through the source code of postgresql and every source files starts with a (really good) description of what the file is supposed to do, which made it really easy to get in to the code compared to other open source projects I've seen. So thanks for that.


I have no idea why that isn't standard practice in every codebase. I should be able to figure out your code without having to ask, or dig through issues or commit messages. Just tell me what it's for!


Because it takes a lot of time and because the comments can get outdated. I also want this for all my code bases. But do I always do this myself? No, especially on green field projects. I will sometimes go back and annotate them later.


They can get outdated but they usually don't. It's a good litmus test for if a file is too big / small if it's purpose is hard to nail down.


Trying to understand what I previously wrote and why I wrote it takes more time than I ever care to spend. I'd much rather have the comments, plus at this point, by making them a "first class" part of my code, I find them much easier to write and I find the narrative style I use incredibly useful in laying out a new structure but also in refactoring old ones.


Even outdated comments can tell you the original purpose of the code, which helps if you're looking for a bug. Especially if you're looking for a bug.

If someone didn't take the time to update the comments and the reviewers didn't point it out, then you've probably found the bug because someone was cowboying some shitty code.


I have the opposite experience.

Outdated comments are often way worse than no comments, because they can give you wrong ideas that aren't true anymore, and send you off in the wrong direction before you finally figure out the comment was wrong.


Indeed. I recently found this piece of code:

    if (X) assert(false); // we never do X, ever, anywhere.
Then I look over to the other pane, where I have a different, but related file open:

    if (exact same X) { do_useful_stuff(); }
It got a chuckle out of me.


Did you update the comment? :-)


// there are two kinds of mutually exclusive commentors

enum kinds { writers; readers; updaters; }


the average programmer thinks they are writting significantly above average clean code, so no need to document it :-)


It kind of is in rust now, with module-level documentation given its own specific AST representation instead of just being a comment at the top of the file (a file is a module).


Uncle Bob hates this.


Having been using and administering a lot of PostgreSQL servers, I hope they don't lose any stability over this.

I've seen (and reported) bugs that caused panics/segfaults in specific psql processes. Not just connections, also processes related to wal writing or replication. The way it's built right now, a child process can be just forced to quit and it does not affect other processes. Hopefully switching into thread won't force whole PostgreSQL to panic and shut down.


Because of shared memory most panics and seg faults in a worker process take down the entire server already (this wasn’t always the case, but not doing so was a bug).


Most likely, the postmaster will maintain a separate process, much like today with pg, or similar to Firefox or Chrome's control process that can catch the panic'd process, cleanup and restart them. The WAL can be recovered as well if there were broken transactions in flight.


100%. Same here. There's a lot of baby in the processes, not just bathwater.

As a longstanding PG dev/DBA who doesn't know much about its internals, I would say that they should just move connection pooling into the main product.

Essentially, pgbouncer should be part of PG and should be able to manage connections with knowledge of what each connections is doing. That, plus, some sort of dynamic max connection setting based on what's actually going on.

That'll remove almost all the dev/DBA pain from separate processes.


Of course it will. That's better than continue working with damaged memory structures and unpredictable consequences. For database it's more important than ever. Imagine writing corrupted data because other thread went crazy.


You're implying that only an OS can provide memory separation between units of execution - at least in .NET AppDomains give you the same protection within a single process, so why couldn't postgres have its own such mechanism? I'd also think with a database engine shared state is not just in-memory - i.e. one process can potentially corrupt the behaviour of another by what it writes to disk, so moving to a single-process model doesn't necessarily introduce problems that could never have existed previously (but, yes, would arguably make them more likely)


No AppDomains are not as good as processes, I have tried to go that route before, you cannot stop unruly code reliably in an app domain (you must use thread.abort() which is not good) and memory can still leak in any native code used there.

The only reliable way to stop bad code like say an infinite loop is to run in another process even in .Net.

They also removed Appdomain in later versions of .Net because they had little benefit and weak protections compared to a a full process.


Not claiming they're as good, just noting that there are alternative ways to provide memory barriers, though obviously if it's not enforced at the language/runtime level, it requires either super strong developer disciple or the use of some other tool to do so. I can't find anything suggesting AppDomains have been removed completely though, just they're not fully supported on non-Windows platforms, which is interesting, I wonder if that means they do have OS-level support.


https://learn.microsoft.com/en-us/dotnet/api/system.appdomai...

"On .NET Core, the AppDomain implementation is limited by design and does not provide isolation, unloading, or security boundaries. For .NET Core, there is exactly one AppDomain. Isolation and unloading are provided through AssemblyLoadContext. Security boundaries should be provided by process boundaries and appropriate remoting techniques."

AppDomains pretty much only allowed you to load unload assemblies and provided little else. If you wanted to stop bad code you still used Thread.Abort which left your runtime in a potentially bad state due to no isolation between threads.

The only way to do something like an AppDomain to replace process isolation would be to re-write the whole OS in a memory safe language similar to https://en.wikipedia.org/wiki/Midori_(operating_system) / https://en.wikipedia.org/wiki/Singularity_(operating_system)


Is that saying global variables are shared between AppDomains on .NET core then? Scary if so, we have a bunch of .NET framework code we're looking at porting to .NET core in the near future, and I know it relies on AppDomain separation currently. It's not the first framework->Core conversation I've done, but I don't remember changes in AppDomain behaviour causing any issues the first time.

As it happens I already know there are bits of code currently not working "as expected" exactly because of AppDomain separation - i.e. attempting to use a shared-memory cache to improve performance and in one or two cases in an attempt to share state, and I got the impression whoever wrote that code didn't understand that there even were two AppDomains involved, and used various ugly hacks to "fall back" to alternative means of state-sharing, but in fact the fall-back is the only thing that actually ever works.


> Is that saying global variables are shared between AppDomains on .NET core then?

No, you can't create a second AppDomain at all. AppDomains are dead and buried; you would need to remove all of that from your code in order to migrate to current .NET. The class only remains to serve a couple ancillary functions that don't involve actually creating additional AppDomains.


We're not creating them ourselves, they're created by IIS.


I don't know .NET enough to comment here, but I'm pretty sure that if you would manage to run bare metal C inside your .NET app (should be possible), it'll destroy all your domains easily. RAM is RAM. The only memory protection that we have is across process boundary (even that protection is not perfect with shared memory, but at least it allows to protect private memory).

At least I'm not aware of any way to protect private thread memory from other threads.

Postgres is C and that's not going to change ever.


I certainly wasn't suggesting it would make sense to rewrite Postgres to run on .NET (using any language, even managed C++, assuming anyone still uses that). Yes, it's inherent in the C/C++ language that it's able to randomly access any memory that a process has access to, and obviously on that basis OS-provided process-separation is the "best" protection you can get, just pointing out that it's not the only possibility.


.NET is a managed-language with a VM. In such language, a memory error in managed-code will often trigger a jump back to the VM, where they can attempt to recover from there.

For native code, there's no such safety net. Likewise, even for managed language, an error in the interpreter code will still crash the VM, since there's nothing to fallback to anymore.


True, if you're talking unrestricted native code, I'd essentially agree with the OP's implication that only the OS (and the CPU itself) is capable of providing that sort of memory protection. I guess I was just wondering what something like AppDomains in C might even look like (e.g. all global variables are implicitly "thread_local"), and how much could be done at compile-time using tools to prevent potentially "dangerous" memory accesses. I've never looked at the postgres source in any detail so I'm likely underestimating the difficulty of it.


For a decades old codebase probably only the OS can.

Point is it getting worse if this is changed.


This reminds me of this poster: "You must be this tall..."

https://bholley.net/blog/2015/must-be-this-tall-to-write-mul...

Back about a decade ago I was "auditing" someone else's threaded code. And couldn't figure it out. But he was the company's "golden child" so by default it must be working code because he wrote it.

And then it started causing deadlocks in prod.

"What do you want me to do about it? It's the golden child's code. He's not even gonna show up til 2pm today."


The thing is... multi-process with a bespoke shared memory system isn't better than multithreading; it's much worse.


I'm not sure if I'd judge it as harshly, but you have a good point: A lot of debugging / validation tooling understands threads, but not memory shared between processes.


In Linux, multi process with shared memory regions is basically just threads. The kernel doesn’t know anything about threads, it knows about processes and it lets you share memory regions between those processes if you so desire.


By bespoke you mean using standard interfaces to create shared memory pools?

They do roll some of their own locking primitives, but that's not particularly unusual in a large portable program (and quite likely what they wanted is/was not available in glibc or other standard libraries, at least when first written).


The difference is between everything shared (threads) and some parts are shared explicitly (processes with shared memory). I'm not sure 2nd is worse.


It kinda is though. The process barrier is better at enforcing careful deliberate interactions


Oracle has similar problems.

On UNIX systems, Oracle uses a multi-process model, and you can see these:

  $ ps -ef | grep smon

  USER      PID  PPID  STARTED   TIME %CPU %MEM COMMAND
  oracle  22131     1   Mar 28   3:09  0.0  4.0 ora_smon_yourdb
Windows forks processes about 100x slower than Linux, so Oracle runs threaded on that platform in one great big PID.

Sybase was the first major database that fully adopted threads from an architectural perspective, and Microsoft SQL Server has certainly retained and improved on that model.


> Windows forks processes about 100x slower than Linux...

I work with a Windows-based COTS webapp that uses Postgres w/o any connection pooling. It's nearly excruciating to use because it spins-up new Postgres processes for each page load. If not for the fact that the Postgres install is "turnkey" with the app I'd just move Postgres over to a Linux machine.


If you run postgres under WSLv1 (now available on Server Edition as well), the WSL subsystem handles processes and virtual memory in a way that has been specifically designed to optimize process initialization as compared to the traditional Win32 approach.


It would not be difficult to simply "pg_dump" all the data to Postgres on a Linux machine, then quietly set the clients to use the new server.


Use pgbouncer


Was curious about this as an architectural solution as well.

We're really talking about X-per-client as the primary reason to move away from processes, right?

So if you can get most of the benefit via pooling... why inherit the pain of porting?

Presumably latency jitter would be a difficult problem with pools, but it seems easier (and safer) than porting processes -> threads.

Disclaimer: High performance / low latency DB code is pretty far outside my wheelhouse.


The reasons are explained in article. Read the article


I appear to have missed them, then.

Could you point out, aside from the large numbers of clients I mentioned (and the development overhead of implementing multi-process memory management code), what the article mentions is a primary drawback of using processes over threads?


> The overhead of cross-process context switches is inherently higher than switching between threads in the same process - and my suspicion is that that overhead will continue to increase. Once you have a significant number of connections we end up spending a lot of time in TLB misses, and that's inherent to the process model, because you can't share the TLB across processes.


Yes, that's per-client performance scaling ("significant number of connections"), which indicates a pooled connection model might mitigate most of the performance impact while allowing some core code to remain process-oriented (and thus, not rewritten).


> We're really talking about X-per-client as the primary reason to move away from processes, right?

Many other things too. Like better sharing of caches. Lower overhead of thread instead of process. Etc. (read the thread)


pgbouncer is not transparent, you loose features, particularly when using the pooling mode actually allowing a larger number of active concurrent connections. Solving those issues is a lot easier with threads than with processes.


That helps a lot but it's not a replacement for large number of persistent connections. If you had that you could simplify things in the application layer and do interesting things with the DB.


Didn't Oracle switch to threaded model in 12c - at least on Linux I remember there being a parameter to do that - it dropped the number of processes significantly.


No, I ran that on v19.

  $ ps -ef | grep smon
  UID        PID  PPID  C STIME TTY          TIME CMD
  oracle   22131     1  0 Mar28 ?        00:03:09 ora_smon_yourdb

  $ $ORACLE_HOME/bin/sqlplus -silent '/ as sysdba'
  select version_full from v$instance;

  VERSION_FULL
  -----------------
  19.18.0.0.0


https://oracle-base.com/articles/12c/multithreaded-model-usi...

Probably still requires the parameter to be set.


Contrast this to Microsoft SQL Server:

  $ systemctl status mssql-server
  ● mssql-server.service - Microsoft SQL Server Database Engine
     Loaded: loaded (/usr/lib/systemd/system/mssql-server.service; disabled; vendor preset: disabled)
     Active: active (running) since Mon 2023-06-19 15:48:05 CDT; 1min 18s ago
       Docs: https://docs.microsoft.com/en-us/sql/linux
   Main PID: 2125 (sqlservr)
      Tasks: 123
     CGroup: /system.slice/mssql-server.service
             ├─2125 /opt/mssql/bin/sqlservr
             └─2156 /opt/mssql/bin/sqlservr


Yeah multiprocess isn't Microsoft's style given how expensive creating processes is on Windows.

Oracle - never had a scalability issue on very big Linux, Solaris and HPUX systems though - they do it well in my experience.


I'm not sure what I wonder on more - seeing its not enabled on boot or seeing mssql under systemd


> Didn't Oracle switch to threaded model in 12c

It's optional, and the default is still a process model on Linux.


It's always amazed me with databases why they don't go the other way.

Create an operating system specifically for the database and make it so you boot the database.

Databases seem to spend most of their time working around the operating system abstractions. So why not look at the OS, and streamline it for database use - dropping all the stuff a database will never need.

That then is a completely separate project which is far easier to get started rather than shoehorning the database into an operating system thread model that is already a hack of the process model.


I'm not sure what you mean by OS. If you mean a whole new kernel, it will take decades. They can support only small number of HW. If you mean a specialized linux distro, many companies does that already.

I don't know how that can make it easier the process based / thread based problem.


This project could borrow a lot from unikernels. If they mandate running it as a VM, there is no HW to support.


That was/is part of the promise of the whole unikernel thing, no?

https://mirage.io/ or similar could then let you boot your database. That said, it's not really taken off from what I can tell, so I'm guessing there's more to it than that.


Imo unikernels are a complicated solution in search of a problem, which turns out to not exist.

There certainly are times OSs get in the way. But it's hard enough to write a good database, we don't need to maintain a third of an OS in addition.


Yeah indeed, that was my feeling on it as well. As much as Linux et al might get in ones way at times, what we get for free by relying on them is too useful to ignore for most tasks I think.

That said, perhaps at AWS or Google scale that would be different? I wonder if they've looked at this stuff internally.


You mean like Microsoft SQL Server, which basically runs a small OS on top of Windows or Linux?

This is actually part of the reason why Microsoft was able to port SQL Server to Linux fairly easily, if I recall correctly.


You can get most of these speedups by using advanced APIs like IO_uring and friends, while still benefiting of using an OS, which is taking care of the messy and thankless task of hardware support.


> Create an operating system specifically for the database and make it so you boot the database.

(Others downthread have pointed out unikernels and I agree with the criticisms)

This proposal is an excellent Phd project for someone like me :-)

It ticks all of the things I like to work on the most[1]:

Will involve writing low-level OS code

Get to hyper-focus on performance

Writing a language parser and executor

Implement scheduler, threads, processes, etc.

Implement the listening protocol in the kernel.

I have to say, though, it might be easier to start off with a rump kernel (netBSD), then add in a specific RAW disk access that bypasses the OS (no, or fewer, syscalls to use it), create a kernel module for accepting a limited type of task and executing that task in-kernel (avoiding a context-swtich on every syscall)[2].

Programs in userspace must have the lowest priority (using starvation-prevention mechanisms to ensure that user input would eventually get processed).

I'd expect a non-insignificant speedup by doing all the work in the kernel.

The way it is now,

userspace requests read() on a socket (context-switch to kernel),

gets data (context-switch to userspace),

parses a query,

requests read on disk (multiple context-switches to kernel for open, stat, etc, multiple switches back to userspace after each call is complete). This latency is probably fairly well mitigated with mmap, though.

logs diagnostic (multiple context-switches to and from kernel)

requests write on client socket (context switch to kernel back and forth until all data is written).

The goal of the DBOS would be to remove almost all the context-switching between userspace and kernel.

[1] My side projects include a bootable (but unfinished) x86 OS, various programming languages, performant (or otherwise) C libraries.

[2] Similar to the way RealTime Linux calls work (caller shares a memory buffer with rt kernel module, populates the buffer and issues a call, kernel only returns when that task is complete). The BPF mechanism works the same. It's the only way to reduce latency to the absolute physical minimum.


> Create an operating system specifically for the database and make it so you boot the database.

I have the impression that this is similar to the adhoc filesystem idea; this seems in principle very advantageous (why employing two layers that do approximately the same thing on top of each other?), but in reality, when implemented (by Oracle), it lead to only a minor improvement (a few % points, AFAIR).


Sounds like IBM OS/400.


It sounds like the specific concerns here are actually around buffer pool management performance in and around the TLB: "Once you have a significant number of connections we end up spending a *lot* of time in TLB misses, and that's inherent to the process model, because you can't share the TLB across processes. "

Many of the comments here seem to be missing this and talking about CPU-boundedness generally and thread-per-request vs process etc models, but this seems orthogonal to that, and is actually quite specific about the VM subsystem and seems like a legitimate bottleneck with the approach Postgres has to take for buffer/page mgmt with the process model it has now.

I'm no Postgres hacker (or a Linux kernel hacker), and I only did a 6 month stint doing DB internals, but it feels to me like perhaps the right answer here is that instead of Postgres getting deep down in the weeds refactoring and rewriting to a thread based model -- with all the risks in that that people have pointed out -- some assistance could be reached for by working on specific targeted patches in the Linux kernel?

The addition of e.g. userfaultfd shows that there is room for innovation and acceptance of changes in and around kernel re: page management. Some new flags for mmap, shm_open, etc. to handle some specific targeted use cases to help Postgres out?

Also wouldn't be the first time that people have done custom kernel patches or tuning parameters to crank performance out of a database.


Exactly my thinking. If the problem is TLB evictions, why not improve these? PG isn't the only software that is hit by those.

And if you carefully craft your CPU scheduling and address spaces mapping, you can reduce them by a lot.

Yet rewrites are always easier and more sexy. At first.


The TLB issue are more a hardware issue than a software/OS one. To my knowledge neither x86 nor arm provide a way to partially share TLB contents between processes/entities. Tlb entries can be tagged with a process context identifier, but that's an all or nothing thing. Either the entire address space is shared, or it's not.

> Yet rewrites are always easier and more sexy. At first.

Moving to threads would not at all be a rewrite.


Pretty sure Tom Lane said this will be a disaster in that same pgsql-hackers thread. Not entirely sure what benefits the multi-threaded model will have when you can easily saturate the entire CPU with just 128 connections and a pooler. So I doubt there is consensus or even strong desire from the community to undertake this boil the ocean project.

On the other hand, having the ability to shut down and cleanup the entire memory space of a single connection by just disconnecting is really nice, especially if you have extensions that do interesting things.


From the article:

> Tom Lane said: "I think this will be a disaster. There is far too much code that will get broken". He added later that the cost of this change would be "enormous", it would create "more than one security-grade bug", and that the benefits would not justify the cost.


You can think of this as an opportunity to rewrite in Rust.


AfterPostgres


May I humbly suggest PostPostgres, or Post²gres?


2Post2Furigres


Postgr3s: Tokyo Thread


>Not entirely sure what benefits the multi-threaded model will have when you can easily saturate the entire CPU with just 128 connections and a pooler.

That the all of those would work faster because of performance benefits, as mentioned in article


"the benefits would not justify the cost". PostgreSQL, like any software, at some point in it's life need to be refactored. Why not refactor with a thread model. Of course there will be bugs. Of course it will be difficult. But I think it is a worthwhile endeavor. Doesn't sound like this will happen but a new project would be cool.


> like any software, at some point in it's life need to be refactored.

This is simply not true for most software. Software has a product life cycle like everything else and major refactors/rewrites should be weighed carefully against cost/risk of the refactor. Many traditional engineering fields do much better at this analysis.

Although, because I run a contracting shop, I have personally profited greatly by clients thinking this is true and being unable to convince them otherwise.


"Difficult" doesn't even begin to do it justice. Making a code which has 2k global variables and probably order of magnitude as many underlying assumptions (the code should know that now every time you touch X you may be influenced or influence all other threads that may touch X) is a gargantuan task, and will absolutely for sure involve many iterations which any sane person would never let anywhere near valuable data (and how long would it take until you'd consider it safe enough?). And making this all performant - given that shared-state code requires completely different approach to thinking about workload distribution, something that performs when running in isolated processes may very well get bogged down in locking or cache races hell when sharing the state - would be even harder. I am not doubting Postgres has some very smart people - much, much smarter than me, in any case - but I'd say it could be more practical to write new core from scratch than trying to "refactor" the core that organically grew for decades with assumptions of share-nothing model.


a better option would just create an experimental fork that has a different name and is obviously a different product but based on the original source. That way pg gets updates and remains stable and if they fail, they fail and it doesn't hurt all the pg in production.


What you're talking about is a rewrite, not a refactor.


I'm rather surprised that their focus is on improving vertical scalability, rather than on adding more features for scaling Postgres horizontally.


If you're more interested in horizontal scaling, you may want to look into CockroachDB, which has a Postgres compatible protocol, but still quite different. There are a lot more limitations with CDB over Pg though.

With the changes suggested, I'm not sure it's the best idea from where Postgres is... if might be an opportunity to rewrite bits in Rust, but even then, there is a LOT that can go wrong. The use of shared memory is apparently already in place, and the separate process and inter-process communication isn't the most dangerous part... it's the presumption, variables and other contextual bits that are currently process globals that wouldn't be in the "after" version.

The overall surface is just massive... That doesn't even get into plugin compatibility.


Isn't this the nature of relational databases (as per CAP)?


This sounds like a problem that would border on the complexity of replacing the GIL in Ruby or Python. The performance benefits are obvious but it seems like the correctness problems would be myriad and a constant source of (unpleasant) surprises.


This is different because there isn’t a whole ecosystem of packages that depend on access to a thread unsafe C API. Getting the GIL out of core Python isn’t too challenging. Getting all of the packages that depend on Python’s C API working is.


An other component of the Gil story is that removing the Gil require adding fine grained locks, which (aside from making VM development more complicated) significantly increases lock traffic and thus runtime costs, which noticeably impacts single-threaded performance, which is of major import.

Postgres starts from a share-nothing architecture, it’s quite a bit easier to evaluate the addition of sharing.


Postgres already shares a lot of state between processes via shared memory. There's not a whole lot that would initially change from a concurrency perspective.


> which (aside from making VM development more complicated) significantly increases lock traffic and thus runtime costs, which noticeably impacts single-threaded performance, which is of major import.

I don't think that's a fair characterization of the trade offs. Acquiring uncontended mutexes is basically free (and fairly side-effect free) so single-threaded performance will not be noticeably impacted.

Every large C project I'm aware of (read: kernels) that has publicly switched from coarse locks to fine-grained locks has considered it to be a huge win with little to no impact on single-threaded performance. You can even gain performance if you chop up objects or allocations into finer-grained blobs to fit your finer-grained locking strategy because it can play nicer with cache friendliness (accessing one bit of code doesn't kick the other bits of code out of the cache).


> which noticeably impacts single-threaded performance, which is of major import.

1) I don't buy this a priori. Almost everybody who removed a gigantic lock suddenly realizes that there was more contention than they thought and that atomizing it made performance improve.

2) Had Python bitten the bullet and removed the GIL back at Python 3.0, the performance would likely already be back to normal or better. You can't optimize hypothetically. Optimization on something like Python is an accumulation of lots of small wins.


> I don't buy this a priori.

You don’t have to buy anything, that’s been the result of every attempt so far and a big reason for their rejection. The latest effort only gained some traction because the backers also did optimisation work which compensated (and then was merged separately).

> Almost everybody who removed a gigantic lock

See that’s the issue with your response, you’re not actually reading the comment you’re replying to.

And the “almost” is a big tell.

> suddenly realizes that there was more contention than they thought and that atomizing it made performance improve.

There is no contention on the gil in single threaded workloads.

> Had Python bitten the bullet and removed the GIL back at Python 3.0

It would have taken several more years and been completely DOA.


> there isn’t a whole ecosystem of packages that depend on access to a thread unsafe C API

They mentioned a similar issue for Postgres extensions, no?

> Haas, though, is not convinced that it would ever be possible to remove support for the process-based mode. Threads might not perform better for all use cases, or some important extensions may never gain support for running in threads.


I question how important an extension is if there’s not enough incentive to port it to the new paradigm, at least eventually.


Well. The thing with that is just that there are a lot of extensions. Like, a lot!


The correctness problem should be handled by a suite of automated tests which PostgreSQL has. If all tests pass, the application must work correctly. The project is too big, and has too many developers to make much progress without full test coverage. Where else would up-to-date documentation regarding the correct behavior of PostgreSQL exist? In some developers head? SQLite is pretty famous for there extreme approach to testing including out of memory conditions, and other rare circumstances: https://www.sqlite.org/testing.html


Parallelism is often incredibly hard to write automated tests for, and this will most likely create parallelism issues that were not dreamed of by the authors of the test suite.


> If all tests pass, the application must work correctly.

These are "famous last words" in many contexts, but when talking about difficult-to-reproduce parallelism issues, I just don't think it's a particularly applicable viewpoint at all. No disrespect. :)


Even the performance benefits are not big enough compare to the GIL.

Biggest problem of the process model might be the cost of having too many DB connections. Each client need a dedicated server process. Memory usage and the context switching overhead. Or if there is no connection pool, connection time overhead is very high.

This problem has been well addressed with a connection pool. Or having a middle ware instead of exposing the DB directly. That works very well so far.

Oracle has been supporting the thread based model and it's been usable for decades. I remember I tried the thread based configuration option (MTS or shared server) in 1990s. But no one likes that at least within my Oracle DBA network.

It would be a great research project but it would be a big problem if the community pushs this too early.


Does GIL stand for Global Interpreter Lock?


yes


It would be interesting to have something between threads and processes. I'll call them heavy-threads for sake of discussion.

Like light-threads, heavy-threads would share the same process-security-boundary and therefore switching between them would be cheap. No need to flush TLB, I$, D$.

Like processes, heavy-threads would have mostly-separate address spaces by default. Similar to forking a process, they could share read-only mappings for shared libraries, code, COW global variables, and explicitly defined shared writable memory regions.

Like processes, heavy-threads would isolate failure states. A C++ exception, UNIX signal, segfault, etc. would kill only the heavy-thread responsible.


> No need to flush TLB

TLB isn't "flushed" so much as it is useless across different memory address spaces. Switching processes means switching address spaces, which means you have to switch the contents of the TLB to the new process' TLB entries, which eventually indeed flushes the TLB, but that is only over time, not necessarily the moment you switch processes.

> Like processes, heavy-threads would have mostly-separate address spaces by default.

This thus conflicts with the need to not flush TLBs. You can't not change TLB contents across address spaces.


There are some problems.

1. Mostly separate address spaces requires changing the TLB on context switch (modern hw lets it be partial). You could use MPKs to share a single address space with fast protection switches.

2. Threads share the global heap, but your heavy threads would require explicitly defined shared writeable memory regions, so presumably each one has its own heap. That's a fair bit of overhead.

3. Failure isolation is more complicated than deciding what to kill.

The expand on the last point, Postgres doesn't isolate failures to a single process because they do share memory and might corrupt those shared memory regions. But even if you don't have shared memory failure recovery isn't always easy. Software has to be written specifically to plan for it. You can kill processes because everything in the OS is written around allowing for that possibility, for example, shells know what to do if a sub-process is killed unexpectedly. Killing a heavy thread (=process) is no good if the parent process is going to wait for a reply from it forever because it wasn't written to handle the process going away.


So what would be different between those and forked processes?


I've been pondering / ruminating with this too; I've been somewhat surprised that few operating systems have played with reserving per-thread address space as thread-local storage, or requiring something akin to a 'far' pointer to access commonly-addressed shared memory.


> Like light-threads, heavy-threads would share the same process-security-boundary and therefore switching between them would be cheap. No need to flush TLB, I$, D$.

> Like processes, heavy-threads would have mostly-separate address spaces by default. Similar to forking a process, they could share read-only mappings for shared libraries, code, COW global variables, and explicitly defined shared writable memory regions.

I don't think you realistically can have separate address spaces and not have TLB etc impact. If they're separate address spaces, you need separate TLB entries => lower TLB hit ratio.


You cannot COW and share the TLB state. The caches aren't flushed in process changes either: it's that the data is different so evictions happen.


A close to impossible task, if anyone can do it's probably Heikki though.

Unfortunately I expect this to go the way of zheap et al. Fundamental design changes like this have just had such a rough time of succeeding thus far.

I think for such a change to work it probably needs not just the support of Neon but also of say Microsoft (current stewards of Citus) that have larger engineering resources to throw at the problem and grind out all the bugs.


I know that at least people from EDB (Robert) and Microsoft (Thomas, me) are quite interested in eventually making this transition, it's not just Heikki. Personally I won't have a lot of cycles for the next release or two, but after that...


That gives me some faith, I hope everyone is able to come together to make it happen.


Hey I'm fairly new to the who's who in the PostgreSQL world, would you mind telling why Heikki might be able to pull this off?


Not who you asked, but: He is a longtime contributor who has written/resigned important parts of postgres (WAL format, concurrent WAL insertion, 2PC support, parts of SSI support, much more). And he is just a nice person to work with.


Cool! He seems like a powerhouse in this space - thank you for the answer


So compromise. Take the current process model, add threading and shared memory, with feature flags to limit number of processes and number of threads.

Want to run an extension that isn't threadsafe? Run with 10 processes, 1 threads. Want to run high-performance? Run with 1 process, 10 threads. Afraid of "stability issues"? Run with 1 process, 1 thread.

Will it be hard to do? Sure. Impossible? Not at all. Plan for it, give a very long runway, throw all your new features into the next major version branch, and tell people everything else is off the table for the next few years. If you're really sure threading is going to be increasingly necessary, better to start now than to wait until it's too late. But this idea of "oh it's hard", "oh it's dangerous", "too complicated", etc is bullshit. We've built fucking spaceships that visit other planets. We can make a database with threads that doesn't break. Otherwise we admit that basic software development using practices from the past 30 years is too much for us to figure out.


Worked on a codebase which was separate processes, each of which has a shedload of global variables. It was a nightmare working out what was going on, not helped by the fact that there was no naming convention for the globals, plus they were not declared in a single place. I believe their use was a performance move, ie having the linker pin a var to a specific memory location rather than copying it to the stack as a variable and referencing it by offset the whole time. Premature optimisation? Optimisation at all? Who knows, but there's a good reason coding standards typically militate against globals.


Per discussion on this very page, in the headlined article, and in the mailing list discussion it references, PostgreSQL is not in that category. It has lots of static storage duration variables, which do not necessarily have external linkage.

Robert Haas pointed out in one message that an implementation pattern was to use things like file-scope static storage duration variables to provide session-local state for individual components. This is why they've been arguing against a single giant structure declared in "session.h" as an approach, as it requires every future addition to session state to touch the central core of the entire program.

They want to keep the advantage of the fact that these variables are in fact not global. They are local; and the problem is rather that they have static storage duration and are not per-thread, and thus are not per-session in a thread-per-session model.


There's something to be said for globals whose access is well-managed, though.

IMO: if the variable is _truly_ global, i.e. code all over the codebase cares about it, then it should just be global instead of pretending like it's not with some fancy architecture.

The tricky part is reacting to changes to a global variable. Writing a bunch of "on update" logic leads to madness. The ideal solution is for there to be some sort of one-directional flow for updates, like when a React component tree is re-rendered... but that's very hard to build in an application that doesn't start out using a library like React in the first place.


There are 2000 globals here, so more like a couple of shedloads. While this is something you'd sort of expect for a product that's been around 30+ years, it really seems like there's a lot of optimization that could happen and still stick with the process model.


I wish they would do some kind of easy shared storage instead, or in addition too. This sounds like an odd solution, however I’ve scaled pgsql since 9 on very, very large machines and doing 1 pgsql cluster per physical socket ended up doing near-linear scaling even on 100+ total core machines with TB+ of memory.

The challenge with this setup is that you need to do 1 writer and multiple reader clusters so you end up doing localhost replication which is super weird. If that requirement was somehow removed that’d be awesome for scaling really huge clusters.


Some interesting discussion on this here also: https://news.ycombinator.com/item?id=36284487


I hope they are conservative about this, because even the smartest and best programmers in the world cannot create bug free multithreaded code.


I mentally snarked to myself that "obviously they should rewrite it in Rust first".

Then, after more thought, I'm not entirely sure that would be a bad approach. I say this not to advocate for actually rewriting it in Rust, but as a way of describing how difficult this is. I'm not actually sure rewriting the relevant bits of the system in Rust wouldn't be easier in the end, and obviously, that's really, really hard.

This is really hard transition.

I don't think multithread code quality should be measured in absolutes. There are things that are so difficult as to be effectively impossible, which is the lock-based approach that was dominant in the 90s, and convinced developers that it's just impossible difficult, but it's not multithreaded code that's impossibly difficult, it's lock-based multithreading. Other approaches range from doable to even not that hard once you learn the relevant techniques (Haskell's full immutability & Rust's borrow checker are both very solid), but of course even "not that hard" becomes a lot of bugs when scaled up to something like Postgres. But it's not like the current model is immune to that either.


Concurrency isn’t a “nice layer over pthreads” - the most important thing is isolation - anything that mucks up isolation is a mistake.

— Joe Armstrong

Threads are evil. https://www.sqlite.org/faq.html#q6 https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-...

Nginx uses an asynchronous event-driven approach, rather than threads, to handle requests. https://aosabook.org/en/v2/nginx.html http://www.kegel.com/c10k.html


The code already is multithreaded. They have shared state just across multiple processes instead of threads within a process.

They might even reduce complexity that way.


It's not the same at all for global variables, of which pgsql apparently has around a couple thousand.

If every process is single threaded, you don't have to consider the possibility of race conditions when accessing any of those ~2000 global variables. And you can pretty much guarantee that little if any of the existing code was written with that possibility in mind.


Those global variables would be converted to thread locals and most of the code would be oblivious of the change. This is not the hard part of the change.


The fact that they are planning on doing this across multiple releases gives me hope that they'll be cautious with this.


Postgres is already concurrent today. There's a lot of shared state between the processes (via shared memory).


Nonsense, multithreaded code can be written as bug free as regular code. No need to fear.


I'm assuming you're referring to formally proven programs. If that's the case, do you have any pointers?

Aside from the trivial while(!transactionSucceeded){retry()} loop, I have trouble proving the correctness of my programs when the number of threads is not small and finite.


This is true. However, the blast radius may be smaller with a process model. Also recovering from a fatal error in one session could possibly be easier. I say this as a 30-year threading proponent.


I think the point is that some mistakes in process based code are not realized as the bugs that they will be in threaded code?


It can be. Anything can be. It is far more treacherous, though.


In theory, yes. In practice, no.


It is just harder.


Seems like a bad idea. Processes are more elegant and scalable than threads as they discourage the use of shared memory. Shared memory is often a bad idea. You end up with different threads competing and queuing up to access or write the same data (e.g. waiting on each other to acquire a lock with mutexes - This immediately disqualifies the system from becoming embarrassingly parallel) and it becomes the OS's problem to figure out when to allow which thread to access what memory... This is bad because the OS doesn't care about optimizing memory access for your specific use case. It will treat your 'high performance' database in the same way as it treats a run-of-the-mill Gimp desktop application....

With the process model, it encourages using separate memory for each process; this forces developers to think about things like memory consistency and availability and gives them more flexibility in terms of scalability across multiple CPU cores or even hosts. Processes are far better abstractions than threads for modeling concurrent systems since their logic is fundamentally the same regardless of whether they run across different CPU cores or different hosts.

> The overhead of cross-process context switches is inherently higher than switching between threads in the same process

I remember researching this a while back. It depends on the specific OS and hardware. It's not so straight forward and this is something which tends to change over time and the differences are usually insignificant anyway.

Also, it's important not to conflate performance with scalability - These two characteristics are orthogonal at best and oftentimes conflicting.

Oftentimes, to scale horizontally, a system needs to incur a performance penalty as additional work is required to route and coordinate actions across multiple CPUs or hosts. A scalable system can service a much larger number (or even sometimes theoretically unlimited) number of requests but it will typically perform worse than a non-scalable system if you judge it on a requests-per-CPU-core basis.


Why should TLB flush performance ever be a problem on big machines? You can have one process per core with 128 or more cores, never flush any TLB if you pin those processes. And as it is a database, shoveling data from/to disk/SSD is your main concern anyways.


PostgreSQL uses synchronous IO, so you won't saturate the CPU with one process (or thread) per core.

That said, I think there have been efforts to use io_uring on Linux. I'm not sure how that would work with the process per connection model. Haven't been following it...


> That said, I think there have been efforts to use io_uring on Linux. I'm not sure how that would work with the process per connection model. Haven't been following it...

There's some minor details that are easier with threads in that context, but on the whole it doesn't make much of a difference.


I don't understand how it works with thread per connection either. io_uring is designed for systems that have a thread and ring per core, for you to give it a bunch of IO to do at once (batches and chains), and your threads to do other work in the meantime. The syscall cost is amortized or even (through IORING_SETUP_SQPOLL) eliminated. If your code is instead designed to be synchronous and thus can only do one IO at a time and needs a syscall to block on it, I don't think there's much if any benefit in using io_uring.

Possibly they'd have a ring per connection and just get an advantage when there's parallel IO going on for a single query? or these per-connection processes wouldn't directly do IO but send it via IPC to some IO-handling thread/process? Not sure either of those models are actually an improvement over the status quo, but who knows.


> io_uring is designed for systems that have a thread and ring per core

That's not needed to benefit from io_uring

> for you to give it a bunch of IO to do at once (batches and chains), and your threads to do other work in the meantime.

You can see substantial gains even if you just submit multiple IOs at once, and then block waiting for any of them to complete. The cost of blocking on IO is amortized to some degree over multiple IOs. Of course it's even better to not block at all...

> If your code is instead designed to be synchronous and thus can only do one IO at a time and needs a syscall to block on it, I don't think there's much if any benefit in using io_uring.

We/I have done the work to issue multiple IOs at a time as part of the patchset introducing AIO support (with among others, an io_uring backend). There's definitely more to do, particularly around index scans, but ...


Oh, I hadn't realized until now I was talking with someone actually doing this work. Thanks for popping into this discussion!

> > io_uring is designed for systems that have a thread and ring per core

> That's not needed to benefit from io_uring

90% sure I read Axboe saying that's what he designed io_uring for. If it helps in other scenarios, though, great.

> Of course it's even better to not block at all...

Out of curiosity, is that something you ever want/hope to achieve in PostgreSQL? Many high-performance systems use this model, but switching a synchronous system in plain C to it sounds uncomfortably exciting, both in terms of the transition itself and the additional complexity of maintaining the result. To me it seems like a much riskier change than the process->thread one discussed here that Tom Lane already stated will be a disaster.

> We/I have done the work to issue multiple IOs at a time as part of the patchset introducing AIO support (with among others, an io_uring backend). There's definitely more to do, particularly around index scans, but ...

Nice.

Is the benefit you're getting simply from adding IO parallelism where there was none, or is there also a CPU reduction?

Is having a large number of rings (as when supporting a large number of incoming connections) practical? I'm thinking of each ring being a significant reserved block of RAM, but maybe in this scenario that's not really true. A smallish ring for a smallish number of IOs for the query is enough.

Speaking of large number of incoming connections, would/could the process->thread change be a step toward having a thread per active query rather than per (potentially idle) connection? To me it seems like it could be: all the idle ones could just be watched over by one thread and queries dispatched. That'd be a nice operational improvement if it meant folks no longer needed a pooler [1] to get decent performance. All else being equal, fewer moving parts is more pleasant...

[1] or even if they only needed one layer of pooler instead of two, as I read some people have!


> > Of course it's even better to not block at all...

> Out of curiosity, is that something you ever want/hope to achieve in PostgreSQL? Many high-performance systems use this model, but switching a synchronous system in plain C to it sounds uncomfortably exciting, both in terms of the delta and the additional complexity of maintaining the result. To me it seems like a much riskier change than the process->thread one discussed here that Tom Lane already stated will be a disaster.

Depends on how you define it. In a lot of scenarios you can avoid blocking by scheduling IO in a smart way - and I think we can quite far towards that for a lot of workloads and the wins are substantial. But that obviously cannot alone guarantee that you never block.

I think we can get quite far avoiding blocking, but I don't think we're going to a complete asynchronous model in the foreseeable future. But it seems more feasible to incrementally make common blocking locations support asynchronicity. E.g. when a query scans multiple partitions, switch to processing a different partition while waiting for IO.

> Is having a large number of rings (as when supporting a large number of incoming connections) practical? I'm thinking of each ring being a significant reserved block of RAM, but maybe in this scenario that's not really true. A smallish ring for a smallish number of IOs for the query is enough.

It depends on the kernel version etc. The amount of memory isn't huge but initially it was affected by RLIMIT_MEMLOCK... That's one reason why the AIO patchset has a smaller number of io_uring "instances" than the allowed connections. The other reason is that we need to be able to complete IOs that other backends started (otherwise there would be deadlocks), which in turn requires having the file descriptor for each ring available in all processes... Which wouldn't be fun with a high max_connections.

> Speaking of large number of incoming connections, would/could the process->thread be a step toward having a thread per active query rather than per (potentially idle) connection?

Yes. Moving to threads really mainly would be to make subsequent improvements more realistic...

> That'd be a nice operational improvement if it meant folks no longer needed a pooler [1] to get decent performance. All else being equal, fewer moving parts is more pleasant...

You'd likely often still want a pooler on the "application server" side, to avoid TCP / SSL connection establishment overhead. But that can be a quite simple implementation.


Problem with all kinds of asynchronous I/O is that your processes then need internal multiplexing, akin to what certain lightweight userspace thread models are doing. In the end, it might be harder to introduce than just using OS threads.


I hope they don't do it.

I've had a similar situation with PHP, where we had written quite a large engine (https://github.com/Qbix/Platform) with many features (https://qbix.com/features.pdf) . It took advantage of the fact that PHP isolated each script and gave it its own global variables, etc. In fact, much of the request handling did stuff like this:

  Q_Request::requireFields(['a', 'b', 'c']);
  $uri = Q_Dispatcher::uri();
instead of stuff like this:

  $this->getContext()->request()->requireFields(['a', 'b', 'c']);
  $this->getContext()->dispatcher()->uri();
Over the last few years, I have run across many compelling things:

  amp
  reactPHP
  Swoole (native extension)
  Fibers (inside PHP itself)
It seemed so cool! PHP could behave like Node! It would have an event loop and everything. Fibers were basically PHP's version of Swoole's coroutines, etc. etc.

Then I realized... we would have to go through the entire code and redo how it all works. We'd also no longer benefit from PHP's process isolation. If one process crapped out or had a memory leak, it could take down everything else.

There's a reason PHP still runs 80% of all web servers in the world (https://kinsta.com/blog/is-php-dead/) ... and one of the biggest is that commodity servers can host terrible PHP code and it's mostly isolated in little processes that finish "quickly" before they can wreak havoc on other processes or on long-running stuff.

So now back to postgres. It's been praised for its rock-solid reliability and security. It's got so many features and the MVCC is very flexible. It seems to use a lot of global variables. They can spend their time on many other things, like making it byzantine-fault-tolerant, or something.

The clincher for me was when I learned that php-fpm (which spins up processes which sleep when waiting for I/O) is only 50% slower than all those fancy things above. Sure, PHP with Swoole can outperform even Node.js, and can handle twice as many requests. But we'd rather focus on soo many other things we need to do :)


I've been using PHP for decades and have found its isolated process model to be about the best around, certainly for any mainstream language. Also Symfony's Process component encapsulates most of the errata around process management in a cross-platform way:

https://symfony.com/doc/current/components/process.html

Going from a working process implementation to async/threads with shared memory is pretty much always a mistake IMHO, especially if it's only done for performance reasons. Any speed gains will be eclipsed by endless whack-a-mole bug fixes, until the code devolves into something unrecognizable. Especially when there are other approaches similar to map-reduce and scatter-gather arrays where data is processed in a distributed fashion and then joined into a final representation through mechanisms like copy-on-write, which are supported by very few languages outside of PHP and the functional programming world.

The real problem here is the process spawning and context-switching overhead of all versions of Windows. I'd vote to scrap their process code in its entirety and write a new version based on atomic operations/lists/queues/buffers/rings with no locks and present an interface which emulates the previous poor behavior, then run it through something like a SAT solver to ensure that any errata that existing software depends on is still present. Then apps could opt to use the direct unix-style interface and skip the cruft, or refactor their code to use the new interface.

Apple did something similar to this when OS X was released, built on a mostly POSIX Darwin, NextSTEP, Mach and BSD Unix. I have no idea how many times Microsoft has rewritten their process model or if they've succeeded in getting performance on par with their competitors (unlikely).

Edit: I realized that the PHP philosophy may not make a lot of sense to people today. In the 90s, OS code was universally terrible, so for example the graphics libraries of Mac and Windows ran roughly 100 times slower than they should for various reasons, and developers wrote blitters to make it possible for games to run in real time. That was how I was introduced to programming. PHP encapsulated the lackluster OS calls in a cross-platform way, using existing keywords from popular languages to reduce the learning curve to maybe a day (unlike Perl/Ruby, which are weird in a way that can be fun but impractical to grok later). So it's best to think of PHP more like something like Unity, where the nonsense is abstracted and developers can get down to business. Even though it looks like Javascript with dollar signs on the variables. It's also more like the shell, where it tries to be as close as possible to bare-metal performance, even while restricted to the 100x interpreter slowdown of languages like Python. I find that PHP easily saturates the processor when doing things in a data-driven way by piping bytes around.


This has Python 3 vibes.


At the end of the day, this doesn't solve any problems. Small setups use postgres directly just fine, and large setups use pgbouncer, and having process isolation with extensions is a good thing and probably simplifies things a lot.

My $0.02


A big advantage of the process-based model is its resilience against many classes of errors.

If a bug in PostgreSQL (or in an extension) causes the server to crash, then only that process will crash. Postmaster will detect the child process termination, and send an error message to the client. The connection will be lost, but other connections will be unaffected.

It's not foolproof (there are ways to bring the whole server down), but it does protect against many error conditions.

It is possible to trap on some exceptions in a threaded environment, but cleaning up after eg. an attempted NULL pointer dereference is going to be very difficult or impossible.


We would still have a separate supervisor process of we moved connections to threads.


I'm curious if they can take advantage of vfork / CLONE_VM, to get the benefits of sharing memory and lower overhead context switches, with the trade of still getting benefits from the scheduler, and sysadmin-friendliness.

The other thing that might be interesting is FUTEX_SWAP / UMCG. Although it doesn't remove the overhead induced by context switches entirely (specifically, you would still deal with TLB misses), you can avoid dealing with things like speculative execution exploit mitigations.


Per the article, Postgres has many, many global variables, many of which track per-session state; much session state is “freed” via process exit rather than being explicitly cleaned up. Switching to CLONE_VM requires these problems to all be solved.


what about support for Windows?


I think an interesting point of comparison is the latest incarnation of SQL Server. You can't even point at 1 specific machine anymore with their hyperscale architecture.

https://learn.microsoft.com/en-us/azure/azure-sql/database/h...


I know I’m probably being naive about this, but is it stupid to ask if there’s a way to make multi process work better on Linux - rather than “fixing” PG?

I feel like the thread vs process thing is one of those pendulums/fads that comes and goes. I’d hate to see PG go down a rabbit hole only to discover the OS could be modified to make things go better.

(I understand not all PG instances run on Linux, just using it as an example)


That'll likely be an even bigger task, and harder to get into mainline kernel.

Linux multi-process is already pretty efficient compared to Windows. However, multi-process is inherently less efficient than multi-thread due to more safety predicates / isolation guaranteed by the kernel, I feel lowering it might lead to more security issues, similar to how Hyper Threading triggered a bunch of issues with Intel Processors.


Right - yeah I was really just wondering if some of the safety predicates could be reduced when there is a relationship between processes, such as the mitigations against cache attacks. I think the cache misses caused by multi-process were one of the reasons given that it's slower than threading. But I don't understand why this is necessarily the case given that the shared memory and executable text ultimately refer to the same data. But I suppose this would need to work with processor affinity and other elements to prevent the cache being knocked around by non-PG processes, and I guess this is one place where it starts getting complicated.

That said, please understand that I'm just being curious - I really don't know what I'm talking about, I haven't built a Linux kernel or dabbled in Unix internals in like 20 years, but thanks for replying :) Postgresql is my favourite open source project and I'm spooked by the threading naysayers.


The TLB is basically keyed by (address space, virtual address % granularity), or needs to be flushed entirely when switching between different views of the address space (e.g. switching between processes). Unless your address space is exactly the same, you're at least going to duplicate TLB contents. Leading to a lower hit rate.

This isn't really an OS issue, more a hardware one, although potential hardware improvements would likely have to be explicitly utilized by operating systems.

Note that the TLB issue is different from the data / instruction cache situation.


> I feel like the thread vs process thing is one of those pendulums/fads that comes and goes.

In this context threads can be understood as processes that share the same address space and vice-versa processes as threads with separate address space.

One gives you isolation, the other convenience and performance. Either can be desirable.

What would you change about this?


This would be one those places where a language like Rust would be helpful. In C/C++ with undefined behavior and crashes, process isolation makes a lot of sense to limit the blast radius. Rust borrow checker gives you at compile time a lot of the safety that you would rely on process isolation for.


Yes, but note that the blast radius of a PostgreSQL process crash is already "the whole system reboots", so there are not a lot of differences between process- and thread-based PostgreSQL written in C.

Rewriting in Rust would be interesting, but it would also probably be too invasive to make it worthwile at all - all code in PostgreSQL is C, while not all code in PostgreSQL interacts with the intrinsics of processes vs threads. Any rewrite to Rust would likely take several times more effort than a port to threads.


PostgreSQL process crash may also just be, one query fails.


It sounds to me like migrating to a fully multi-threaded architecture may not be worth the effort. Simply reducing the number of processes from thousands to hundreds would be a huge win and likely much more feasible than a complete re-architecture.


They could host those currently subprocess inside Wasm environments. This would be largely a mechanical transformation. Even the current shm based architecture would stay intact.


Not sure what's going on here, but one connection per process seems... ancient.

Using threaded model is difficult, how about pre-fork? Some connections per process is a good improvement.


The issue is costlier (runtime, complexity, memory) resource sharing, not the cost of the fork itself. Pre-forking isn't going to help with any of that.


Isn’t this why pgbouncer is so effective? Maybe it’s not the forking itself, but there is something about creating connections that is expensive, warranting such external connection poolers.


So when we as a whole decided that multiprocessing is a much better approach from security and application stability point of view, they decide to go with threads?


Horses for courses I guess - purely threaded vs purely MP both have different set of tradeoffs and shoehorning one over the other always fails some use cases. The article says they are also considering the possibility of having to keep both process and thread models indefinitely for this and other reasons.

I know nothing of PG internals but I can see why process per connection model doesn't work for large machines and/or high number of connections. One way to do it would be to keep connection handling per thread and still keep multiprocess approach where it makes sense for security and doesn't add linear overheads.


Changing the entire architecture of PG to suit the 0.1% of edge cases seems like a poor trade off.

Are there a much larger percentage of users that really need this?


I wonder if it would be easier to create a C virtual machine that emulates all the OS interaction, then recompile Postgres and the extensions to run on this. Perhaps TruffleC would work?

https://dl.acm.org/doi/10.1145/2647508.2647528


Hard to believe that would provide any benefit without also causing massive slowdowns.


Why change postgres? Just fork it if you want to change something this fundamental.


Because it makes it a lot easier to address some of postgres' weaknesses? This is a proposal by long time contributor to postgres, that a number of other long time contributors agree with (and others disagree with!). Why shouldn't have Heikki brought this up for discussion?


This is interesting because Google just created AlloyDb[0] which is decidedly multiprocess for performance and switches out the storage layer from a read/write model to write+replicate + read-only model.

The deep dive[1] has some details; the tl;dr: is that the main process only has to output Write Ahead Logs to the durable storage layer which minimizes transaction latency. The log processing service materializes postgres-compatible on-disk blocks that read-only replicas can read from, with a caching layer for block reads which sends cache invalidations from the LPS to read replicas.

I'm not sure if similar benefits could be seen within a single machine; using network DMA or even rDMA to transfer bytes to and from remote machines also avoids TLB invalidation. There are some mentions in the mailing list of waiting for Linux to support shared page mappings between processes as a solution.

I'm not exactly sure I understand the reasoning behind process separation as crash recovery; as far as I understand each connection is responsible for correctness and so if a process crashes there seems to be an assumption that the database can recover and keep working by killing that process but that seems like it risks silent data corruption; perhaps it's equivalently mitigated by separating materialization of blocks from the sync'd WAL in a separate process from the multithreaded connection process producing WAL entries?

[0] https://cloud.google.com/alloydb [1] https://cloud.google.com/blog/products/databases/alloydb-for...


Mysql guys: Smart move, Postgres!


i wonder if gpt-11 will be able to do this kind of project on its own...


Changing something so fundamental seems like it should be a rewrite.


ngmi


surely this is "Some guy reconsiders the process-based model of PostgreSQL"


Uh. Heikki is definitely not just "some guy". Dude is one of the top contributors to Postgres.


How does that make him immune to having dumb ideas? See, I'm judging the idea on merit.

You're just defending your hero who has gone rogue.


Heikki is far from the only "senior" postgres contributor thinking that this is the right direction in the long term.

> You're just defending your hero who has gone rogue.

That doesn't sound like judging an idea on its merit, it sounds like judging a person for something you haven't analyzed yourself.


> Heikki is far from the only "senior" postgres contributor thinking that this is the right direction in the long term.

sounds like groupthink


I must have missed all the nuanced judgment in your original post. Maybe you can quote some for me.


not rising to this sarcastic bait.


this will be the beginning of the end of postgres


"no objections" <> "consent"


Have you ever tried to move a large organization forward in a certain direction? It’s really hard. At some point you have to make a decision.


I have. What I've observed more is outside attackers with their own agenda use the "nobody objected because they were unprepared and unable to respond in the 2 minutes I gave them to object" as proof their agenda is supported.


Not in something like Postgres, I hope


I am going to go ahead and trust Tom Lane on this one, over someone who is working on "serverless Postgres". Godspeed to the forthcoming fork.


Heikki Linnakangas is one of the top Postgres contributors of all time, he isn't just "someone." The fact he's working for a startup on a fork (that already exists, which you can run right now on your local machine) doesn't warrant any snide dismissal. Robert Haas admitted that it would be a huge amount of work and that it would only be achievable by a small few people anyway, Heikki being among them.

Anyway, I think there are definitely limits that are starting to appear with Postgres in some spots. This is probably one of the most difficult possible solutions to some of those problems, but even if they don't switch to a fully threaded model, being more CPU efficient, better connection handling, etc will all go a substantial way. Doing some of the really hard work is better than none of it, probably.


I wonder what AWS’s PostgreSQL-compatible Aurora looks like under the hood. Does it use threading, processes, both?


Calm down guys! Threading is tricky but they can rewrite it all in Rust so it'll be completely ok........

;-)


Unpopular idea: age limit on votes/contributions.

We are in a unique time period and generation that has strong opinions based on a history that only exists within itself.

At some point it needs to rewrite.


This feels like developers are bored and want a challenge.


It's a multi-decade ask from many PG users and a serious pain point for many deployments.


Finally! This and a good multi master story and I'll finally start to love Postgres


Is there any reason at all people use intrinsically bug-prone and broken multithreading mode instead of fork() and IPC apart from WinAPI having no proper fork?


Yes, and some of those reasons are even listed in the article.


TLB misses? They are just a detail of particular CPU implementation, and the architectures change. Also, aren't they per core and not per process? What would that solve then to switch to MT?


> TLB misses? They are just a detail of particular CPU implementation, and the architectures change.

TLBs are "just a detail" of roughly 100% of server, desktop, and mobile CPUs.

> Also, aren't they per core and not per process? What would that solve then to switch to MT?

TLB entries are per address space. Threads share an address space, processes do not.


That sounds like really hard programming. I’m glad I write react and get paid possibly much more.


We're glad you're writing React too :)


I feel this sort of undertaking could only be done by those programmers who truly value domain knowledge above all else (money, etc). I'm more of the entrepreneureal mind so I generally only learn as much as needed to do some task (even if it's very difficult), but just seeking information as a means to an end doesn't feel fulfilling to me. Of course many people DO find that, and its upon those people's shoulders that heroic things like this rest, and I'm very thankful to them.


Please don't use mutable global state in your work. Global variables are universally bad and don't provide much of a benefit. The number of desirable architectural refactoring that I've witnessed turning into a muddy mess because of them is daunting. This is one more example of this.


Thank you for sharing your ideological views, but this is not the appropriate venue for that. If you want to have a software _engineering_ discussion about the trade offs involved in sharing global mutable state, this is a good venue for that. All engineering is trade offs. As soon as you make blanket statements that X is always bad, you’ve transitioned into the realm of ideology. Now presumably you mean to say it’s almost always bad. But that really depends on the context. It may well be almost always bad in average software projects, but PostgreSQL is not your average software project. Databases are a different realm.


Discrediting my argument by labeling it as ideology and by implying that "blanket statements are always bad" is a logical fallacy that does not touch the merits of what is discussed and I would argue that your argument instead of mine is the one that does not belong here.

If you want to contribute to the discussion, I'd be happy to be given an example of successful usage of global variables that made a project a long term success under changing requirements compared to the alternatives.


Global mutable state being a poor choice in software architecture isn’t an ideology. There is no ideology that argues it is awesome.

If you want to have a software _engineering_ discussion about the trade offs involved in sharing global mutable state, this is a good venue for that.

All engineering is trade offs. As soon as you start telling people they’re making blanket statements that X is always bad, you’ve transitioned into the realm of nitpicking.


It's awesome where performance considerations are paramount. It's awesome in databases. It's awesome in embedded software. It's awesome in operating system kernels.

The fact is sometimes it's good. Saying it's universally bad is going beyond the realm of logic and evidence and into the realm of ideology.


Can you explain how having a global variable is more performant than passing a pointer to an object as a function argument in practice?


Using globals is simpler, it's also pretty natural in event driven architectures. Passing everything via function arguments is welcome for library code, but there's little point to using it in application code. It just complicates things.


The problems it causes for Postgres are outlined in the article on LWN.


> Globals work well enough when each server process has its own set...

PostgreSQL uses a process model. So the article just states that globals work fine for PostgreSQL

> Knizhnik has already done a threads port of PostgreSQL. The global-variable problem, he said, was not that difficult.

I see no big problem based on information from person who did some porting already.


Knizhnik made these variables thread local, which is fine if you have a fixed association of threads to data. This looses some flexibility if your runtime needs to incorporate multiple sessions on one thread (for example to hide IO latency) in the future. In the end, the best solution is to associate the data that belongs to a session with the session itself, making it independent on which thread it's running on. This is described by Knizhnik as "cumbersome", which is exactly why people should have not started with global variables in the first place. (No blame, Postgres is from 1986 and times were very different back then).


You know what a database is, do you? It is the place where you store your mutable global state. You can't kick the can down the road forever, someone has to tackle the complexity of managing state.


Databases are great, especially those who do not use global variables in their implementation.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: