Hacker News new | past | comments | ask | show | jobs | submit login
Odyssey: Scalable PostgreSQL Connection Pooler (github.com/yandex)
151 points by kermatt on Dec 11, 2019 | hide | past | favorite | 53 comments



A quick meta BTW, but if you want to keep up with Postgres stuff, we have a newsletter (and covered this Odyssey release earlier today): https://postgresweekly.com/issues/335 (there is RSS, etc.)


thanks petey


Thank you!


If we are able to control/modify the protocol at both ends, not just in the middle, would this still be the optimal solution?

I've been thinking for a while that connection-oriented DBMS protocols seem like a good candidate for being carried as flows over a single-socket multiplexed protocol, like HTTP/2-over-TCP is today. (I say "over TCP" because connections between app servers and databases are long-lived and stable, so there isn't really the same advantage in replacing TCP with something like QUIC there.)

Also, such a protocol, by removing the direct "connection = session" association, would remove the ability of DBMSes to key their session state by socket ID, and force them to refactor to more explicit session IDs. This would mean that there'd be an opportunity to introduce another intermediate abstraction layer during the refactoring, keeping "session" separate from "flow", such that the DBMS could have multiple concurrent MVCC transactions/worldstates as individual flows, all referencing the same ancestor MVCC worldstate (with the same config vars set, temporary objects created + populated, etc.) as an explicit stateful session that the flow would be "opened against." Which would be pretty awesome, and would also benefit non-socket-multiplexed flows (e.g. pgbouncer's per-transaction routing mode would no longer have non-SQL-conformant semantics.)


I think there's a circular argument that connections are expensive, queries are supposed to cheap, so anything you can pay for once at connection creation time should be done then, which then guarantees that creating connections is expensive.

How much of that work has to be done for each session sharing one connection?

I'm a bit spoiled. The code I work on at work and the hobby project I'm designing share a common quality, in that they are both read-mostly. It is not going to be overly difficult for me to segregate read and write traffic and use separate connection pools - possibly separate servers. Just making that change is going to save me a considerable amount of overhead for resource management, including connections.


You should look at R2DBC then: https://r2dbc.io/

"R2DBC (Reactive Relational Database Connectivity) is an endeavor to bring a reactive programming API to SQL databases."

It already has client and driver implementations for several major DBs. Whether the databases themselves pick it up remains to be seen and will probably takes years.


HTTP2 could be great, but how would you implement this with Postgres, which uses one process per connection? You'd have to do all the multiplexing in a single server process, but you'd still have to have one process per underlying session, I think, and so you get the same problem with per-connection overhead that a pool tries to solve.


One of the advantages of a multiplexed protocol, is that it would reduce the need for each client to keep around a pool of N open+idle connections, just in case the client needs to abruptly make a low-latency query on a new session in parallel to its existing ones.

Part of the reason abrupt scheduling of low-latency parallel queries requires clients to keep a pool of open+idle connections, is that the PG backend takes a moment to fork(2) out a process to serve your connection, and nothing about multiplexing eliminates that cost; but another part of the reason, is that it takes the TCP socket a moment to establish, and takes the TCP window even longer than a moment to scale. (And, intriguingly, open+idle separate-TCP-socket flows scale their TCP windows down while idle, so current connection-pooling strategies aren't even getting 100% of the possible benefit.)

Decreasing the costs of flow establishment would decrease the number of open+idle flows clients keep around, which means the backend host would be free to do other things with its freed-up resources.

In combination with the fact that sessions would be "grouped" to a connection, such that long-term metrics could be captured for the behavior of a client (= one connection) as a whole, you could design an alternative "elastic scheduler" for session backends, without needing to change the forking model. E.g. keep a number of prefork idle backend processes around, associated with each connection, where the size of the prefork pool varies dynamically per connection with the observed velocity of the client's opening of new sessions on that connection. (So, for a client that does everything under one session, you'd have zero observed session-open events per unit time, and thus have zero prefork processes waiting around for that client.)


Scalable in which sense?

Assuming that this is something like pgbouncer or pgpool that sit between the client and the database, and that you have a limit of connections with the database as well as the number of client connections you can keep, what value does this adds compared with those other (more mature, battle tested, included in major distros) projects?


Pgbouncer is single-threaded. If you have 80-core box it is a huge waste to terminate your SSL in single-threaded proxy pooler.


Starting from version 1.12.0 you can actually run multiple pgbouncer processes listening on the same port, that eliminates some of the issues.


Yes, we use a cascade of PgBouncers. But it's hard to maintain, actually. Also, PgBouncer was not actively developed for years. I consider new PgBouncer features (port reuse, SCRAM-SHA-256) accomplishment of Odyssey to some (small) extent :)


Not sure what you mean that it wasn't developed for years.

They had releases at least once a year (only in 2018 they had one release).


This year they had 3, and the year is not over yet :)


Curious why this isn't a priority for the Postgres team. Not my area of expertise, and even I know about pgbouncer. Now another tool, so it seems like there's demand for it.


The "straightforward" way to deal with this is a multi-year project. So, the reason is likely expense.

The guts of the matter is that Postgres, and its extensions, relies on a number of global variables. These global variables are more akin to a thread-local variable in multi-threaded programs, since there is one copy per forked backend. More nominally global state is carefully stored in shared memory.

The ramifications of changing this, e.g. to allow an execution state that can be cleanly suspended and resumed when scheduling, are rather large. Also, consider extensions, which to date could rely on their own global variables and connection termination exiting the entire process.

As-is, all the poolers have to live with obscure but important abnormalities where they cannot reset the process state quite properly, and that can be quite confusing. To eliminate this behavior is very expensive.


There is definitely a demand for it, there are really three problems that exist with connections in Postgres.

1. The initial time to establish a connection is higher than would be ideal

2. Each connection, even an idle one, has an overhead of roughly 10 MB. For idle connections this is pretty much wasted space and most application frameworks grab a pool of connections accumulating quite a few idle connections.

3. The max number of connections you can push the system to is somewhere in hundreds to low thousand. Yes, it is possible to push PG further but it is not trivial.

But, the sad reality is it isn't a trivial quick fix. There are a lot of foundational issues with connection that have to be fixed. There are some folks and efforts in the community working towards this, and in my opinion it is the single biggest improvement that could happen in Postgres over the next 2-3 years, so hopefully it will improve directly in core over time.


Without broken formatting:

> 1. The initial time to establish a connection is higher than would be ideal

> 2. Each connection, even an idle one, has an overhead of roughly 10 MB. For idle connections this is pretty much wasted space and most application frameworks grab a pool of connections accumulating quite a few idle connections.

> 3. The max number of connections you can push the system to is somewhere in hundreds to low thousand. Yes, it is possible to push PG further but it is not trivial.


Can you remove the leading spaces so that this text is readable without scrolling? Even on my 4k monitor I must scroll to read each line.


The reason for it is because PostgreSQL wants to concentrate on correctness and do one thing and do it well. They will add a feature if there is a good reason for it and they find an elegant solution.

Having said that pooler is needed in all databases not just PostgreSQL. An evidence for it is just recently released MySQL proxy service in AWS. Here's[1] another database independent pool sizing guide with demo video of Oracle.

You are aware of pgbouncer mainly because of its feature which is transaction pooling, where each transaction is treated independently, it allows you to have thousands connections open, at the cost of not being allowed to use any session specific features of postgresql. This mode is really good for web applications and can work well even if your pool on application side is misconfigured.

[1] https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-...


> Having said that pooler is needed in all databases not just PostgreSQL. An evidence for it is just recently released MySQL proxy service in AWS.

That's a bit apples and oranges though. The main use-case for that AWS MySQL proxy is high-concurrency Lambda/serverless applications, which conceptually have no other possible means of persistent connections or application-side connection pooling.

MySQL's thread-per-conn model tends to scale well up to several thousand concurrent connections, which means a proxy is only literally essential for more niche situations -- for example, large-scale monolithic environments that need an insane connection count; large-scale sharded environments where it's impractical for each app server to maintain a conn pool to each shard; cross-region SSL connections where a reconnect-each-request model introduces too much latency.

That said, a really solid multi-purpose proxy such as ProxySQL provides other powerful benefits, such as query routing, query rewriting, stale replica read avoidance (read-after-write consistency), etc.


PostgreSQL also can also handle thousands of connections (if you have enough RAM) the problem is that it won't be as performant, due to how concurrency works.

You get increased performance until a certain point (typically 2*number of true cores + more if database waits for disk), after that your performance starts degrading due to overhead. That's how time slicing works. Thread-per-conn can make the overhead smaller but it is still there. Aurora PostgreSQL in AWS has threaded connection pool, but you still get a better performance if you place pgbouncer in front of it and reduce number of connections.

HikariCP[1] talks about it and is database independent.

[1] https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-...

Edit: looks like the video in the article is no longer there, but found one on youtube: https://www.youtube.com/watch?v=xNDnVOCdvQ0


> if you have enough RAM

That's the key difference here, combined with the observation that for user-facing web applications, a decent portion of direct DB connections (i.e. without a multiplexing proxy or conn pool) tend to be idle at any given time.

Assuming good configuration of session-level buffer sizes, MySQL can happily have several thousand idle connections at any given time, with no detrimental performance penalty and without requiring a massive amount of RAM. It's been a while since I've examined Postgres under similar conditions, but previously this certainly was not the case there.


There’s Postgres Pro Enterprise which is a commercial fork that has a connection poller feature.

Back in May it seemed to outperform Odyssey, according to a benchmark by one of the Postgres Pro folks: https://github.com/postgrespro/postgresql.builtin_pool/wiki/...


Aurora PostgreSQL also has a connection pooler, but it will still lose performance when you open too many connections. No matter what is used you still need to limit number of open connections to the database, so a pooler like pgbouncer in transaction mode which can drastically reduce them might still be needed.


Mostly because it's a lot of work due to Postgres process-per-connection architecture and since these projects already exist, so dev time is better spent elsewhere for now.

PG generally depends on 3rd-party tools to complete and polish the experience for those that need it.


> Odyssey has sophisticated asynchronous multi-threaded architecture which is driven by custom made coroutine engine: machinarium

Let me find my space helmet. The voice in my head is crying 'architectural astronaut'.

We are forgetting again to ask the important question of "what else did you try, and why didn't it work so you had to write your own?" I think I get the 'why' for pooling. I know a guy at work who struggles a bit with query rate (query speed is great... until it hits the pool limits). I don't know the 'why' for the technology it's built on top of. Lacking that, I see someone wanting to write machinarium and then building something on top of it.

Which doesn't sound like a bad thing at all, until you find a feature the 'thing on top of it' gets wrong which is consistent with the thing the primary authors are most invested in. You can't have that feature because machinarium doesn't want you to have it. Close ticket.

Form always triumphs over function when the architect is invested in their solution and not the user. Making a library that embodies that solution, at least in my experience, vastly amplifies that investment. That also telegraphs the investment, which if you know to look for can tell you 'buyer beware'. Buyer beware.

Machinarium sounds a lot like libuv: https://github.com/libuv/libuv/wiki/Projects-that-use-libuv

Have I got that wrong?


Disclaimer: I worked at Yandex in 2006-2007.

Yandex is a very BIG company in terms of users, requests, data stored etc. It's surely bigger than Twitter, I believe it is bigger than Netflix. It is also algorithms company (like "not content company") So, when Yandex doing something, is mostly doing it because all other options were failed on their load.

Yandex also has a very extensive expertise in C/C++. There is a ClickHouse, there is a CatBoost. I would trust them in their domain.


I respect Yandex a lot. They are fantastic at doing 80% of what Google, Uber, Amazon (combined) do with like 10% of resources - people and otherwise. So their software tends to be super efficient.


They are fantastic at loosing their best talent to the western counterparts. Nothing fantastic at all for a big company within other reaches.


It's not the Yandex' fault, it is a political situation in Russia. IT skills is the best ticket out for honest people today, can't blame them.


Your claim is wrong in that "best talent" part. Though, Yandex, from time to time, loose new grads, sad but true.


Oh well-well, elaborate please about their scale. Bigger than Twitter, bigger then Netflix. Just give us some proof about that.


We were using libevent and had to remove an extra abstraction layer. Machinarium is there for two reasons: 1. Express network protocol state machine as normal control flow. Just to make code simpler. 2. Reduce contention of multithreaded operations. In Machinarium you have coroutines which can yield context not to OS scheduler but to other coroutines. You can organize local data around worker (thread - owner of coroutines) and avoid false cache line sharing.


For scaling (LB) reads + HA, HAproxy + built-in replication to hot standbys is much simpler. You can use keepalived to promote a new master automatically if it dies. HAproxy can also give you two different ports on the vIP, one for read-only and one for read-write that automatically find which server is the master and which servers are replicas, using periodic banner testing. It's free, doesn't depend on someone coding a Postgres statement proxy and getting every corner-case just right, and lower latency. I suggest using LVS too so the vIP can move around if a box dies.

For scaling (LB) read&writes + HA, consider postgres-xl instead. It's FOSS w/ optional commercial support.

If you have crazy OLAP or metrics data, consider Greenplum. Freemium FOSS.


Is this an alternative to PGBouncer, and if so, how do they compare?


This would be an alternative to pgbouncer, this github issue has a rundown of some of the differences - https://github.com/yandex/odyssey/issues/3


Are there benchmarks comparing this to Hikari? What is the benefit to using this over Hikari?


This is an external pool, rather than an internal one. That is to say that it runs as a separate process that re-implements the Postgres network protocol to accept incoming connections. The use-case is that it allows you to share one pool across multiple application processes. Depending on your architecture, it can actually make sense to use both an internal pool to reduce connection startup latency and an external pool to reduce your total number of connections.


Hikari is a Java ODBC connection pooling library, meant to be used in your apps. Most other languages/frameworks also have client drivers that implement connection pooling today, but that only affects that specific program.

Odyssey / PGBouncer / PGpool are all external programs that implement connection pooling so multiple programs and servers can pool connections to a single database.


Looks cool. Maybe I'm missing something though, but it's not really clear how you'd set this up? Is there a quick start or some examples of configs for different situations?


Yeah there’s detailed info on the contents of the config file, but not where to put said file, or how to tell the binary where to find and reference it. This looks like an exciting project that needs more externally-facing docs. A lot of knowledge is assumed at the moment.


I think it's a connection pooling proxy for postgres, and you talk to it using the regular postgres client protocol?

https://github.com/yandex/odyssey/blob/master/test/odyssey/t...


How does this compare to pgBouncer?


This is definitely my top question coming in here. I know what pgbouncer can do for me, and it's pretty good, is there a good reason to look at this instead?


It's multithreaded PgBouncer.


pgbouncer is event based for a reason, they recently added an option to run multiple processes listening on the same socket so that allows it to utilize all available cores.


This is cool. And we expect them to add more and more features if Odyssey will gain traction. It's about the competition. Finally, it should create enough pressure to make builtin pooler happen.


I also would like to know this.


This looks very cool!


Just curious to see if anyone knows how it compares to using Apache Drill as an external scalable process to Postgres Jdbc store?


Can it do load balancing and failover?


No fail-over supported so far.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: