Fixing Database Connections in Django

zzzeek · on March 8, 2013

Django-PostgresPool? What, I don't even...

https://github.com/kennethreitz/django-postgrespool/blob/mas...

    from sqlalchemy import event
    from sqlalchemy.pool import manage, QueuePool

Oh, yes now that makes sense ! :)

craigkerstiens · on March 8, 2013

If it works why not just reuse it :)

ldng · on March 8, 2013

The third solution, django-db-pool, doesn't look like it's using SA (at quick glance). Even has a postgis driver.

jeremyw · on March 8, 2013

Can anyone recommend a Django connection pool for MySQL that -- and this is the important part -- closes local pool connections to the database on a routine basis.

Reason: load balancers and machine maintenance. In per-connection mode (i.e. no connection pooling), operations folk can take machines out for service quickly: put the box in MAINT (ala HAProxy) or equivalent, wait a few moments, see the connections drain away to zero.

But I haven't been able to find connection pooling logic that recycles backend sockets on a short basis. For instance, 60s is long enough to amortize the cost of connection setup, but short enough to make your operations people very happy.

zzzeek · on March 8, 2013

seems like I'll have to reply to every post on this thread but sqlalchemy's pool does this, either optimistically or pessimistically, see http://docs.sqlalchemy.org/en/rel_0_8/core/pooling.html#deal... (in particular pool_recycle).

jeremyw · on March 8, 2013

Yes, SQLAlchemy is excellent in this respect, but without getting into a Django ORM vs SQLAlchemy debate, it is the less-preferred option among Django developers.

zzzeek · on March 8, 2013

the pool can be used in isolation of the rest of SQLAlchemy, which is what Kenneth's Django-Pool library is doing.

est · on March 9, 2013

I think Django db connection are doing it plain wrong. Connection are established and gone with a request, but in fact we can simply use a global variable for any SQL/NoSQL connection pool in views.py, which is basically a WSGI global variable.

In this case, connections are established when a gunicorn/uWSGI worker process starts, and shutdown when the worker is respawned.

jeremyw · on March 9, 2013

Right, the problem is all the failure modes. You want a battle-tested library that hands you a valid connection, that will work to the 99th percentile. Lots of edge cases.

est · on March 9, 2013

Django's connection is a battle-tested failure.

jeremyw · on March 9, 2013

Connection-per-request is battle-tested. Other scenarios, what we're discussing, are not.

rosser · on March 8, 2013

If you're using a connection pooler with Django — or really, anything SQL database related in Python — please be mindful that the Python DB API is implicitly transactional. That is, if you say:

  SELECT COUNT(*) FROM users;

DBAPI says:

  BEGIN;
  SELECT COUNT(*) FROM users;

...and leaves the COMMIT or ROLLBACK of that transaction up to you.

Aside from the severe risk of resource leaks, that has implications for which pooling modes you can use. Particularly, you can't use statement-level pooling, and unless you're scrupulous about managing those implicitly created transactions, transaction-level pooling is pretty much the same thing as connection-level pooling.

Adjust your expectations accordingly, or turn on auto-commit.

(EDIT: clarification.)

zzzeek · on March 8, 2013

the SQLAlchemy connection pool (which is what this is using) calls connection.rollback() when the connection is checked back in. as far as "leaking" transactional state it's not different from "autocommit", which is what Postgresql connections normally do (and what psycopg2 will do if you set `connection.autocommit=True`).

rosser · on March 8, 2013

How is a connection "checked back in"?

zzzeek · on March 8, 2013

the pool is transparent around acquisition and release:

    conn = pool.connect()
    conn.close()

the above "close()" is a release/checkin.

daGrevis · on March 9, 2013

What are the odds. Friday my task was to implement this for our project that uses Django. I learned it all the hard way, even the South trick. The only difference was that we are using MySQL and _django-mysqlpool_.

If almost all solution are using SQLAlchemy components and SQLAlchemy is NOT DBMS-specific, it should be possible to create one package that works for SQLite3, MySQL and Postgres at the same time. In theory, of course. :)

bjourne · on March 9, 2013

This old thread about it is interesting: https://groups.google.com/forum/?fromgroups=#!topic/django-u...

I really hope Django gets support for persistent connections and/or connection pooling. pgbouncer isn't always an option on shared hosting (and is a royal PITA to setup on Windows) and there are other databases than postgres.

craigkerstiens · on March 9, 2013

As per the bottom of the post persistent connections have already been committed for Django 1.6 - https://github.com/django/django/commit/2ee21d

jaytaylor · on March 9, 2013

Here @ SendHub we've been using django-db-pool[0] since Q3 of 2012 in production and it's been pretty smooth sailing.

NB: I highly recommend verifying via postgres logs that the module is doing what you expect, because it may not be immediately clear whether the driver has bee activated or not.

[0] https://github.com/gmcguire/django-db-pool

hcarvalhoalves · on March 8, 2013

The diagnostic is right, but the remedy given in the article seems worse than the disease.

Using backends to make Django pool the connections is sub-optimal. You'll have a different pool per worker process, and they won't persist when you restart your workers - right when you need low latency the most. Use pgbouncer/pgpool instead. On top of that, these poolers have some nice features.

zzzeek · on March 8, 2013

if all worker processes have an evenly distributed workload, how is pool-per-worker process a disadvantage? it's still a fixed-size pool, just one that's an aggregate of several smaller pools.

hcarvalhoalves · on March 8, 2013

Appended to my comment:

> You'll have a different pool per worker process, and they won't persist when you restart your workers - right when you need low latency the most

zzzeek · on March 8, 2013

So, pooling that has TCP/DB initialize connect overhead during a restart only is "worse than" having TCP/DB initialize connect overhead on restart as well as every subsequent request ?

There's pros/cons to in-app pooling vs. middleware pooling, but calling in-app pooling "worse" than no pooling at all is kind of misleading.

If I were using PGBouncer, I'd probably leave SQLA's pooling on anyway, so at least you have no TCP latency at all on connect. unless you're dealing with apps that go unevenly idle/active.

hcarvalhoalves · on March 8, 2013

> but calling in-app pooling "worse" than no pooling

You don't need to get all worked up. Nowhere I affirm it's worse than no polling.

I'm talking about not reinventing the wheel, specially since pooling inside the worker is inferior than just setting up pgpool.

zzzeek · on March 8, 2013

there's a lot of cases where just running an app and having it talk to the database in a halfway-efficient way is a lot more convenient than having to install and start a whole separate service.

As well as, in-app pooling is the easiest way to limit the max simultaneous connection checkouts per application. If PGBouncer were to handle this, I'd assume you'd need to either send distinct usernames or some other instance-specific token over so that it could identify sources of connections.

jtchang · on March 8, 2013

I love sqlalchemy!

ltbarcly3 · on March 9, 2013

In application connection pools are a BadThing™.

Why is pooling connections in your application a bad thing you ask?

1. It keeps open N connections to the database for each one of your M server processes. This is especially bad in Postgresql, as that means you have N * M idle processes on the database server.

2. It is very easy to improperly reset a database connection before giving it back out of the pool. This leads to two problems.

problem 1: The first is that if code using a db connection has an error and doesn't issue a rollback, the connection will still be in a bad state when it is handed back out of the pool again. On a webserver, this means that you will see strange exceptions where a query failed, and is very hard to debug, because the actual error was on the previous request to use that connection.

problem 2: It is inevitable that somewhere in your code you will not issue a rollback on a connection after starting a transaction. In fact, if you are using Python the default dp-api behavior is to "transactional mode" where a BEGIN is sent immediately before any query on a new connection. It will automatically start the transaction, but it will not COMMIT or ROLLBACK automatically. The connection will be returned to the pool with a transaction started, and the transaction will be kept open, potentially for days or weeks. This can cause an incredible number of headaches, performance problems, and cause your database's on disk size to grow much more quickly.

3. If you use a server like gunicorn, or a library like multiprocessing, open file handles (including sockets) are shared upon fork(). This means that if you aren't very careful to completely remove and recreate all the connections in your connection pool after a fork, you will have multiple processes sharing the same sockets. The processes will then step all over each other and corrupt each others use of the database.

I have seen all of these problems multiple times in production code.

So, what should you do?

If you are using postgres, install PGBouncer on every server. Configure your application to make a new connection to PGBouncer whenever it needs a new connection. The overhead of this is tiny, since you connect to PGBouncer through a filesystem unix socket. PGBouncer keeps open as many connections as it needs to the database server, maintains those connections, and puts them in a good state before giving them to you. PGBouncer will also gradually reduce it's connections to the database server if they are idle and unused long enough.

I'm sure other databases have some similar database proxy.

Taking this advice will give you better performance, far fewer bugs, and hold many less connections to the database server. Additionally, all your database uses can benefit from connection pooling in this way, even shell scripts which call psql.

In application pools of database connections is something that will bite you over and over, just say no.

alexpopescu · on March 9, 2013

"I'm not a Django person"

1. Using PGBouncer seems to be the ideal solution, but I'm not sure other databases actually provide a proxy. I haven't heard of something like this for MySQL (a) I might be wrong, please provide a link, b) I'm know some will say "just don't use MySQL")

This aside, I would expect a connection pool implementation to support at least the following:

1. min: minimum number of connections

2. max: maximum number of connections

3. idleTime: time interval after which idle connections are automatically closed

4. connection cleanup (most often done when returned to the pool). Basically making sure that the connection object is pristine

If it doesn't do all these, then it's probably not something you'd want to use.

zzzeek · on March 9, 2013

Sorry, the points you make here are all very much like the FUD I read about ORMs - all based on experiences with inferior software, not an approach.

> 1. It keeps open N connections to the database for each one of your M server processes. This is especially bad in Postgresql, as that means you have N * M idle processes on the database server.

Suppose you use PGBouncer, and you configure it to allow 50 connections. Or, you configure your web servers to allow 10 concurrent processes each with 5 connections. Total number of connections is...the same !

> 2. It is very easy to improperly reset a database connection before giving it back out of the pool. This leads to two problems.

It is actually not possible at all when using a pool like that of SQLAlchemy's, as any connection that gets returned to the pool is properly reset by the pool itself.

> problem 1: The first is that if code using a db connection has an error and doesn't issue a rollback, the connection will still be in a bad state when it is handed back out of the pool again.

The pool emits the ROLLBACK whether or not the application did so. Problem solved.

> problem 2: It is inevitable that somewhere in your code you will not issue a rollback on a connection after starting a transaction.

It would be crazy to leave connections idle in the pool with transactional state on them. No connection idle in the pool has transactional state, as they do not get returned without a rollback. If a catastrophic error occurred on the connection, it's tossed out of the pool. This is all detected at a layer of abstraction below what the application sees, so the programmer has no chance of messing it up.

> In fact, if you are using Python the default dp-api behavior is to "transactional mode" where a BEGIN is sent immediately before any query on a new connection. It will automatically start the transaction, but it will not COMMIT or ROLLBACK automatically.

this is all the same non-issue.

> If you use a server like gunicorn, or a library like multiprocessing, open file handles (including sockets) are shared upon fork(). This means that if you aren't very careful to completely remove and recreate all the connections in your connection pool after a fork, you will have multiple processes sharing the same sockets. The processes will then step all over each other and corrupt each others use of the database.

right, that's why multiprocessing always provides hooks for an init function that runs right as the new child process starts. In this function, you basically create a new pool (or reset the one you have). I'd disagree with the characterization that one must be "very careful to completely remove and recreate all the connections", this makes it sound like some dramatic programming effort; it's not. It's called pool.dispose(), it's designed for this case, and it does it in one step.

SQLAlchemy's pool is designed such that it doesn't pre-allocate connections, so if your subprocess only uses one connection, the pool will only do so as well.

It's true that in this scenario, you're losing the usual benefits of a pool if your subprocess lives for only the length of a single transaction, but it doesn't make things worse, as long as you remember to start up the new pool across process boundaries.

> In application pools of database connections is something that will bite you over and over

its obvious you've used some very poorly written pooling solutions, but the good news is that these problems have been solved. PGBouncer is great but isn't strictly necessary, and also can be combined with in-app pools. Only an in-app pool gives you the means to limit the connections used by an individual process (as I mentioned elsewhere, if PGBouncer allows this, it would have to be through a more complex and inconvenient system of configuring individual servers to connect differently so that they can be differentiated). SQLAlchemy's pool is deployed in thousands of production environments including some big ones like Reddit, Yelp and Dropbox. There is no issue.

ltbarcly3 · on March 9, 2013

EDIT: I had written a really long and specific reply, full of bombast and insults.

EDIT2: Hey zzzeek, sorry you saw that before I edited it. I realized my mistake about the 2012 thing, and then when I re-read what I had written I realized it was over the top.

PGBouncer has all the features that any in application connection pool has, and it doesn't have any of the pitfalls. It's easier to set up, and once set up gives you lots of benefits that an in-application pool cannot give you.

Your example stating that the total number of connections is the same is just incorrect. PGBouncer shares connections between processes, and will release connections beyond it's minimum gradually if they aren't used within a few minutes. SQLAlchemy's connection pool doesn't release connections below it's minimum, ever. If there is a load spike, SQLAlchemy's pool will either become a bottleneck, or will create more connections as needed, which are immediately destroyed and recreated if the traffic spike isn't completely uniform (say if it goes up and down even a little).

If I said to you, you have two choices. They both take about the same amount of effort to set up, except one performs far better than the other, consumes less resources, and gives you lots of other benefits, why would you choose the lesser option?

zzzeek · on March 9, 2013

Edit: apparently as I finished typing this, the parent post edited itself out...though there's some decent info about how to reset sessions here. I'll take the more heated quotes/responses out.

> However, lets say you use the SQLAlchemy pool's overflow setting. In that case, new connections will be made as needed, except they will be discarded as they go back to the pool, if the pool has 10 back in it already. Since connections aren't shared between processes, this means many, many new connections will be made and discarded, adding a sizable amount of extra load to a database server already heavily taxed by the high level of traffic.

That's the key, that the app is going to be using N number of connections constantly, and the "max overflow" is used for spikes. Nothing is stopping you from combining the in app pools with PGBouncer if you're optimizing for this particular case. There's nothing wrong with PGBouncer. But there's also nothing wrong with in-app pools.

> Nothing, the PGBouncer solution is superior in every respect.

the two approaches are not mutually exclusive.

>> The pool emits the ROLLBACK whether or not the application did so. Problem solved.

> 2. Really? When did you add this feature?

in 2007. Were you using SQLAlchemy in version 0.2 or 0.3 or something? That would be pretty unfortunate, it was more or less alpha software at that time.

> I see it, it is called reset_on_return, and was added to SQLAlchemy in version 0.7.6, which was released in 2012.

You are incorrect. reset_on_return is a setting that allows the behavior to be configurable - before that, it was on in all cases. Users of MySQL with MyISAM wanted to turn it off as they weren't using transactions anyway and it was taking up time they didn't want to spend.

> Also, ROLLBACK is completely insufficient here, as things like temporary tables, query planner variable changes, and many other very common things persist on the SESSION level.

Those are unusual and exotic cases, which if one were using (haven't seen too many web apps that create temporary tables) would have to be planned for. If someone really wants to make session level changes, we have an event system by which those changes can be reset when the connection's state is finished. I've gone and checked PGBouncer's documentation - they deal with it in exactly the same way, which is that you need to set up your own query using "server_reset_query".

> Additionally, SQLAlchemy doesn't check that the database connection is valid before letting it out of the pool, just when it is returned.

by default this would waste a lot of time. But if you want to do that, here is the documentation to enable that feature: http://docs.sqlalchemy.org/en/rel_0_8/core/pooling.html#disc...

> If the database is restarted, SQLAlchemy will gladly hand you database 'connections' where the TCP connection has been closed.

if you're not using pessimistic handling (which most people don't), you will get at most one of these. the entire pool is flushed when this condition is detected.

> Right. Or you could just use something like PGBouncer, and not have to do this. Also, PGBouncer will have it's connections pooled and ready to go across application restarts, forks, whatever.

so use PGBouncer as well. Nothing is wrong with PGBouncer, it's simply not universally necessary.

> That is something that has never come up for me, and honestly I think it is YAGNI.

honestly that's what I think about your entire first paragraph, the vast majority of apps don't really need that level of detailed tuning. The app servers all hold onto a handful of connections, make some new ones as needed, and it works just fine. But I won't say that your scenario doesn't have merit, partially because it's too late at night for me to parse all that and it's obvious that very high performance scenarios can benefit from having more automation at more levels, nobody is saying PGBouncer is a bad product. I've just never had to bother installing/starting/configuring a whole new daemon just to have a medium size web application run adequately.

> Well, there are a bunch of issues: worse performance, many more database connections that just sit idle (doesn't matter with mysql, matters a lot with postgres),

If the applications are idle, then you'd have idle connections, but I haven't experienced that being an issue - the database is idle too ! If you have some servers spiking and other ones idling, then its an issue, so use PGBouncer for that case. But that's not typical in my experience.

ltbarcly3 · on March 9, 2013

PGBouncer uses "DISCARD ALL" as it's default method to recycle connections, which has all the effects of creating a new connection. Of course it is configurable. I would strongly advise you to do this as well.

As for idle connections, Postgresql uses a process-per-connection model, and therefore idle connections use a fair amount of resources, and having too many connections can have other bad consequences for performance. (Too many is generally considered to be more than 100.)

I don't see any reason to use an in-application connection pool in addition to pgbouncer. However, if I were configuring SQLAlchemy, I would just use the NullPool.

EDIT: HN won't let me reply to your reply to this comment, but here is how I would make DISCARD ALL the default behavior.

Add a method reset_connection() to the connection proxy base class, with the default implementation just being self.rollback()

Then, on the dialect specific postgresql one, make the implementation do something better, like DISCARD ALL (which isn't supported on postgres older than 8.2 btw).

zzzeek · on March 9, 2013

> PGBouncer uses "DISCARD ALL" as it's default method to recycle connections, which has all the effects of creating a new connection. Of course it is configurable. I would strongly advise you to do this as well.

yes I'm glad we had this discussion so that I've learned PG has this handy command. We can't do it by default because it's a PG-only behavior and I don't have an analogue for all the other databases, but we can add it to the Postgresql dialect documentation as a recommended addition for apps that are modifying session state.

> As for idle connections, Postgresql uses a process-per-connection model, and therefore idle connections use a fair amount of resources, and having too many connections can have other bad consequences for performance. (Too many is generally considered to be more than 100.)

If I'm going over 100 connections for a single PG database, which is not something that occurs with the typical medium-size web applications I work with (they are more like in the 20 connection range), then I'd use PGBouncer. But in-app pooling gives me something that works immediately in any situation and on any database equally well.

> I don't see any reason to use an in-application connection pool in addition to pgbouncer. However, if I were configuring SQLAlchemy, I would just use the NullPool.

Probably not, though I'd want to see what kind of overhead psycopg2 has starting up connections and such, I think it's not much. But this is not the case for other DBAPIs and databases that don't have a product like PGBouncer (like cx_oracle, which is dog slow on connect, ODBC over unix, etc.)

pjenvey · on March 8, 2013

Use SQLAlchemy.

zzzeek · on March 8, 2013

read the source, it is !