Hacker News new | past | comments | ask | show | jobs | submit login
Ban 1+N in Django (suor.github.io)
175 points by Suor on March 26, 2023 | hide | past | favorite | 148 comments



There is a case where having N+1 queries are beneficial.

In Rails terms, it's when you perform Russian doll caching, but you can do this in any framework. The idea is you can cache a specific X thing which might make a query to an associated Y thing. A textbook N+1 query case (ie. a list of posts (X) that get the author's name (Y)).

If you render the view without any cache with 10 things then you'd perform 20 queries but after the cache is warm you'd perform 0 queries. If item 5's Y gets updated then you only need to bust the cache for item 5 and query only item 5's Y association. Performing a preloaded query to get all X 10 things with their Y associated things could be an expensive query.


You'd ideally want to do something like dataloader, where you look up your N Xs in a single cache query, and then do a single database lookup for the (N-C) Xs that weren't in cache. You can then either eagerly load the Ys with the Xs like you said, or do a secondary cache lookup for every Y, and potentially another single database query for the Ys not in cache.

Unfortunately this pattern gets really hairy if you're not using promises and an event loop.

https://www.npmjs.com/package/dataloader


If you have multiple ways to 'see' the same X from multiple Y objects, then all of this get complicated quickly.

Once you're there a microservice has some advantages. Wrap a cache with a service, implement multi-get, anything not in the cache calls through to the database.


+1. The JS event loop auto-monad-izing Promises into Haxl [1]-esqe trees of implicitly-batched loads has been a big win for us building on JavaScript/TypeScript.

If I had to move to another language, I'd really want to find a "powered by the event loop / dataloader" framework, i.e. Vert.x for Java.

Also, per dataloader, a shameless plug for our ORM that has dataloader de-N+1-ing built natively into all object graph traversals:

https://joist-orm.io/docs/goals/avoiding-n-plus-1s

[1]: https://github.com/facebook/Haxl


> The JS event loop auto-monad-izing Promises into Haxl [1]-esqe trees of implicitly-batched loads

can you explain what this means?


So, in other words, you want GraphQL.


A dataloader sort of pattern tends to be on the implementing end of GraphQL.


Indeed, the parent was describing built-in features of the Apollo GraphQL Client, such as entity-level caching in the frontend.

However, a dataloader pattern can also be useful when implementing custom resolvers for a GraphQL API. For this purpose, Apollo provides data sources which handle caching at request level.


Now you have like three problems instead of one - N+1 queries in the cold-cache case is slow, cache invalidation when something changes, and much more overall complexity...


> cache invalidation when something changes, and much more overall complexity

With Rails this style of caching is built into the framework, it's not too bad. It can be applied to other frameworks too because most of the "magic" is how the cache key is composed. Lots of frameworks have the bits and pieces to pull it off.

Here's a snippet from the Rails docs on caching[0]:

    <% @products.each do |product| %>
      <% cache product do %>
        <%= render product %>
      <% end %>
    <% end %>
That will automatically produce cache keys that look like this:

    views/products/index:bea67108094918eeba42cd4a6e786901/products/1
The key is scoped to the view responsible for rendering the snippet of HTML and each product. If any attribute of that product changes that would result in different HTML being produced then the cache will be busted automatically by Rails because that hash in the middle of the cache key will be different. You can also "help" Rails when it comes to associations by adding a `touch: true` argument to your relationship once at the model level so Rails knows to bust the cache if the association's attributes change.

Like everything with caching, it's not 100% bullet proof and perfect but if you know the rules of how the framework works you can solve 95% of your caching needs without much headache. For the other 5% you have options but you can keep them on a need to know basis and look them up when you encounter issues (you may never need them in the end).

[0]: https://guides.rubyonrails.org/caching_with_rails.html#fragm...


cacheops can do the same, but it never came to my mind to actually go this way. I would rather write some special code to fetch with `id__in=[...]` and then cache individually.


Depending on cache timing you can get N/2 copies of old data and N/2 copies of new data. And then any recursive calls that also need cached values grab the new Data but now half of your request is using all new data, and half is using half old data, leading to confusing results.


These are just the problems with any cache. And still, caches are quite useful.


Also, off-by-one errors.


GraphQL can handle all of this in an elegant manner.


Could you be more specific? GraphQL itself as a query language does not handle much. For a given query, a GraphQL back-end normally needs to do some non-trivial work with the database. Doing that efficiently is hard. It is easy to encounter all the typical problems here. I don't see how GraphQL particularly helps.


That's great if you can fit a lot of your database in your server's memory, but seems like a terrible headache once you get a decent number of users.

Personally, I'd much rather have sane queries in the first place, but rails isn't really my cup of tea either, so take my opinion with a large pinch of salt if you do.


> That's great if you can fit a lot of your database in your server's memory, but seems like a terrible headache once you get a decent number of users.

Once you have few million users then you can think about better solution.

You can fit a lot into server memory, and spilling out of RAM to NVMe isn't that bad either.


> That's great if you can fit a lot of your database in your server's memory, but seems like a terrible headache once you get a decent number of users.

You'd surely care about getting a significant chunk of your usage in server memory rather than what percentage of total data that is, no?

To take the site we're on as an example, I'd be willing to bet the 30 things on the front page have one or two orders of magnitude more traffic than anything else (and probably a few more orders of magnitude more than the median post).


That seems much more specialized than what I'd imagined based on the prior description.

In this example, would rails only cache models that fit certain query parameters? Or is it a configurable LRU? How does the in-memory cache work when you have multiple puma workers? Or does this mechanism rely on something more esoteric? Given that this technique is part of solving the performance problems of N+1, I'm assuming things like votes and comments are included, and the high degree of write volume would imply that all of the caches need to stay up to date- at least with a fairly high degree of consistency.


The downside here is a potential thundering herd issue if you’re forced to clear the cache.


cache with grace period ("serve old record while new is updating") is good solution here


That doesn’t help on process/container/VM restarts.


Right. Or just let the DB do the caching. If you're not making million user app or have hundreds of gigabytes to query it will be faster than anything done in Ruby code anyway.

> If you render the view without any cache with 10 things then you'd perform 20 queries but after the cache is warm you'd perform 0 queries. If item 5's Y gets updated then you only need to bust the cache for item 5 and query only item 5's Y association. Performing a preloaded query to get all X 10 things with their Y associated things could be an expensive query.

So instead of "do one query to the database, zero if cache is hot, one query again if you invalidate it", you do 20 queries, zero if cache is hot (but 20 cache queries), then 1 if you invalidate it.

That's nice method when single entry query is very expensive but in most cases it will be living in DB cache already.

It is only a win if DB is the bottleneck which by far is rarer case nowadays than RTT. Also you could just query all 20 records at once but write it to cache as separate records, then have write-thru logic for updating records. Best of both at cost of some complexity.

You can also just... cache the resulting page instead of data-level caching, and not engage app code in most requests in the first place. Partial caching with ESI is also possible although pretty involved.


> zero if cache is hot (but 20 cache queries)

Rails lets you fetch_multi which means it would be 1 cache query if you used this feature.

> It is only a win if DB is the bottleneck

View rendering is pretty slow in Rails, rendering a view ends up being a real bottleneck. That's partly why caching that at the view layer is beneficial.

If Rails were able to cache templates as well as other frameworks then I would agree that in a lot of cases you wouldn't need to worry about it and could preload most things.


> If you render the view without any cache with 10 things then you'd perform 20 queries but after the cache is warm you'd perform 0 queries. If item 5's Y gets updated then you only need to bust the cache for item 5 and query only item 5's Y association. Performing a preloaded query to get all X 10 things with their Y associated things could be an expensive query.

The technique you describe makes sense, but how often is that preload query actually that expensive? Usually if I see an expensive query in Rails, it's because either 1) the indexes are missing for the joins or 2) you're instantiating extra full-blown objects when only some tiny amount of data of a primitive type is required.

This kind of stuff is a good opportunity to just write the query directly that gets the data you really do need, rather than relying on the ORM and its overhead.

If the problem is "maybe the user will want to see this, maybe they won't" then in many cases the easiest win is to lazy load the more detailed view with a new query that is only fired on some user interaction.


Writing things directly is tightly coupling code. Sometimes quite distant portions of it, i.e. database access and presentation logic. With ORM and some smarter lazy technics or introspection at least you can untie it, with hand written SQL there is no way. And the reason is strings are poorly composable. Unless you use some query generator, but then we are back to ORM-like something.


Of course, it's not a one-size fits all solution. I agree that ORMs are incredibly helpful and using them is almost always the right place to start for this sort of thing because they are very flexible and take away the cognitive load.

But we're talking about performance here... if/when you start to have issues, it is time to take a look and see where you can do better.

Rails' ActiveRecord gives you tools to help build queries whether or not you materialize them as full blown objects. If you need to do some specific data extraction but want to avoid a bunch of needless instantiation, you can still leverage the ORM to help build the queries (and avoid writing raw SQL as strings in your codebase) which is a very helpful halfway point.


If only there were some way to retrieve information from disparate tables in a single request. Almost like some way to "join" the tables together... you could design a simple declarative language specific for querying in such a way. A "query language" if you will. It would have a simple structure, a structured query language, that enables you to get this data in a performant way without making N requests!

I might go and make this.


Whilst shallowly funny the sarcasm actually shows a lack of understanding of the problem.

The problem can be described as this: given an arbitrary point in a program, how can you infer what data is required at a future point without executing code between those two points?

I’d love to see you go make a solution to this.


The point is that there is no "generic" solution. The abstraction of the ORM has created the illusion that you don't need to worry about this, the ORM will handle it. N+1 examples like iterating over all the publications from an author arise because the goal is to not have to concern yourself with the fundamentals of data fetching.


I’m not sure that position lives up to reality. It’s easy to avoid N+1 queries in Django, there are lots of ways to reduce that to 1 joined query or 2 in the case of a m2m. The issue here isn’t that “orms are bad”, the issue here is that “knowing how data flows through your system is hard”.

You’d have exactly the same issue as you’d have if you’d call “get_author_publications” O(n) times in some nested call stack. Except getting out of it wouldn’t be as easy as calling “select_related(…)” - you’d instead just end up re-inventing an ORM.


> Except getting out of it wouldn’t be as easy as calling “select_related(…)” - you’d instead just end up re-inventing an ORM.

This could be where we differ. I wouldn't propose that at all, I'd propose writing a query more appropriate for whatever part of the code contains the iteration.


> I'd propose writing a query more appropriate for whatever part of the code contains the iteration.

This is an identical solution to adding “select_related()” in the right places, and thus we loop back to the start of our thread.


> The abstraction of the ORM has created the illusion that you don't need to worry about this

It's not an illusion, it's an abstraction that sometimes leaks. But all abstractions are leaky.

There are plenty of projects that never reached a point where N+1 became something they needed to worry about. There are plenty of projects where ORM allowed to delay the need to worry about N+1, thus allowing them focus on product.

And that means that abstraction is working fine. It's not perfect, but it's working.


Would GraphQL be a good starting point for such a language?


It was sarcasm, he was referring to SQL lol


select_related in Django does this. There's also prefetch_related in Django which is not something that is easily done in standard SQL.


Yes, just transform it to an SQL statement.


Self plug: Checkout https://github.com/har777/pellet to easily find and fix django N+1 issues.

I usually add it to existing integration tests so that they raise exceptions on N+1. If test coverage is low then I would suggest sending the N+1 metrics to something like datadog. That way your users using the product will reveal all the N+1 issues on your monitoring solution.

EDIT: I should add a screenshot to the README lol but the middleware will print each api being called with a nice table showing each query and the number of times it was called for the api.


This is great!

I also use https://pypi.org/project/django-zen-queries/, but Pellet might be better (Zen queries doesn't let you run any queries, which sometimes doesn't work).

Here's a screenshot for you:

https://imgz.org/i8XkiK2R.png


Yay! And thanks for the screenshot :D


This is very useful! I'm gonna start integrating it with my projects. However having a way to allow/not allow n+1 queries (like a context manager) would be much better.

The thing is that there are times where n+1 isn't a big problem and fixing it would be a form of premature optimisation. I'd prefer to be in control and decide if I care about the n+1 query situation or not for some specific view.


I agree that in most cases this will amount to premature optimisation. Most django sites will forever remain small, and there's a lot of benefit to a simple, intuitive (if inefficient) ORM. Namely; that's it's easy for a jr developer to get in, and get results. (That's why we're coding in python, yes?)

Naturally yes, there's a certain scale where N+1 starts to really matter. And in those situations it might be helpful to be able to enable some feature to warn (or error) on N+1. But optimizing your code to avoid run in fewer queries is often (in cases where prefetch_related doesn't cut it) non-trivial. And presumably when you get to this scale, you're also code-reviewing the jr developers.. so I'm not sure how urgent it is that we all "Ban N+1 in Django" as a rule.


Should be easy enough to implement. You only need a context manager that adds 1 to some threadlocal flag on enter and subrracts on exit then check this flag in the monkey patch. Not sure how costly that will be though.



You could probably implement in in a context manager, have the `__enter__()` method execute the `_DA_get_original, DeferredAttribute.__get__ = DeferredAttribute.__get__, _DeferredAttribute_get` code and have the `__exit__` method undo that re-assignment.

(Or maybe the reverse is better. Ban N+1 by default and the context manager `__enter__` puts back the original assignment, `__exit__` brings back the banned version).


It's not ideal, but you can simply log `DeferredAttribute.__get__` calls instead of raising an exception.


Just keep a list of exemptions in a file


I don't disagree with the author in principle, but I find once the data gets big enough where it makes a difference, I've already shifted to using ".values()" to avoid the overhead of model creation, and the KeyErrors that will throw if I leave the query lazy is tantamount to the solution he describes.


Same, and using raw queries too when needed. Identify hot spots and do those. Regular ORM is fine for the rest (mostly).


I usually use `.values()` and `.values_list()` only when model creation is really slow. Usually many rows fetched.


I would love to see this become something that can be toggled as a setting in core. The only time it's useful is opening up a shell to debug. The rest of the time, it's just making it really hard to find missing prefetches or `.only()` and `.defer()` calls that are too limited. It's never a good feature for a production site running at any kind of scale.


Rails has Bullet[0] to help identify and warn you against N+1

Does Django have anything active? Quick search revealed nplusone[1] but its been dead since 2018.

[0] https://github.com/flyerhzm/bullet

[1] https://github.com/jmcarp/nplusone


The second, longer snippet from the OP article seems to do basically this. To modify it to be exactly like what bullet describes (warning you when in debug or test mode, instead of raising an exception, and have no effect in prod) you can replace line 31 with the logger.error() from line 33, and delete the else-case (lines 32 & 33)


It's not a risk I'm willing to take nor do I believe other general Django engs should take. It's fine if it's a pet project or if you've got deep knowledge of Django's inner workings. This requires understanding the affects of monkey patching core __get__ on DeferredAttribute.

When reading the code vs core Django, it doesn't faithfully reproduce the normal case missing _check_parent_chain[0]. I'm not sure if that code path is supposed to be left out or of it's simply missing? The documentation of the code snippet doesn't explicitly state either.

Code snippets that affect the project as a whole combined with one layer removed is library code. It needs to come with strong specs. The worst case scenario is you get differing behavior in production and development.

I 100% appreciate the spirit of the post, but Monkey Patching core Django files is not to be taken lightly in production code.

[0] https://github.com/django/django/blob/main/django/db/models/...


If I understand it correctly, that's what [django-seal](https://github.com/charettes/django-seal) does.


definitely seems like a good candidate.


There is django-zen-queries[0], mentioned in another comment.

[0] https://github.com/dabapps/django-zen-queries


Not a fan of this one since it only works on templates and requires special code


Never used this, but I see that it provides a context manager/decorator, which you can use to "shield" any code. And also some shortcuts for DjangoREST or something.

Regarding extra code, you can simply replace the Djangos' `render()`, with the one from this lib and you are done. I would have probably made a custom template extension or something to do that for all renders automatically though too.


Yeah I really dislike that. It's putting a code path that will never run in production.

This feels like code that has special switch `if env.is_test` just because you needed that for a spec. It's a really bad practice.

This library sticks that code _everywhere_. The library should be loaded only in local/dev env. Arguably, it shouldn't even be loaded in test. But because of it's style, it needs to be loaded in prod.


I wonder why ORMs still(?) work as simple wrappers and never track access patterns. If you see that `in books` generator’s results experience accesses through a relationship, it’s pretty obvious to join it in advance after few misses and serve `book.author.full_name` from cache.

Of course that would make ORM more complex, but why would you need one otherwise. A good database interface should make good guesses, probably with some hints that could be slapped over without changing naive code patterns. But instead we get 1000 different cool ways to create a table and select a row by id.


I don't want my app to change performance-sensitive behavior at runtime, that's a debugging nightmare.

I do want my framework to throw an error in development mode if I screw up the preloading.


Since this also requires the same heuristics, it could be just an option: orm.unroll(“auto|throw|none”). Would suit you, me and a commenter who wants hundreds of requests as they wrote.


Because people want their queries to have predictable performance.


Interestingly, they never get it. Every layer of the stack up to the client's database connector is unpredictable.

But yes, people want it. And will trade a lot of performance for a false promise of predictability.


I dont know any of this but doesn't the OP give a solution similar to this https://github.com/Suor/django-cacheops


It’s hard to tell by skimming through the readme, but it seems there’s no mention of this technique.


cacheops is more of a general purpose ORM caching with automatic invalidation.


See also django-zen-queries https://github.com/dabapps/django-zen-queries , which can make it impossible for changes to a template to trigger queries.


I like Zen queries, but sometimes you just have to make queries in templates (e.g. when you want to check the user on the request object), or sometimes it's just convenient and there's nothing wrong with it (e.g. when you want to check the user on the request object).

Zen queries makes that use case impossible, sadly.


I don't actually use Zen queries myself, but it does have a queries_disabled template tag, which would allow for more specific control.


Came here to post just that. Really like zen queries


This is why I always advocated against ORMs. It’s so easy to fall into traps like this without even knowing it, and while you can work around it in some ORMs it is not obvious.

Writing SQL is not that hard, and mapping the results to a type isn’t that hard either. So with an ORM you might end up saving several hours of work up front for lots of pain later.


> Writing SQL is not that hard, and mapping the results to a type isn’t that hard either.

But... Then you've written an ORM.


Nitpick: no, you've written a pure object mapper, that doesnt care about schema relations. This has the practical advantage of being just a data container that can be clearly serialized/deserialized, instead of a model object with a transitive database connection dependency.


I don't know what serialization/deserialization has do with this. Does the object map to database rows and there's code that moves the data back and forth? That's an object relational mapper.


mapping database rows to object structures is an object mapper. An object relational mapper also keeps track of table dependencies (such as related fields).

If you read a row from a database and generate an object whose attributes map the fields in the database and are used to retrieve the values (a data object), that is an object mapper. This means when fetching eg. user.type it will return 1 instead of the data object for the corresponding row on user_type;

If you read a row from a database, exactly like the above, but user.type returns a data object representing the related table row, that's an object relational mapper.

Regarding serialization, why does it matter? Because you need to serialize and de-serialize data objects or models if you're adding cache to eg. a service layer. Also, serialization and de-serialization are quite important when interfacing with eg. external systems - Imagine having an application-wide InvoiceModel that can be transported via REST, GRPC, kafka/json or any other format, and that is database-agnostic.


> An object relational mapper also keeps track of table dependencies (such as related fields).

that's not what "relational" means. "relational" means, "a relational database". where we are using SQL statements to deliver data to and from such a database. an ORM that does not directly interpret objects along one-to-many collections /many-to-one attributes is still an object relational mapper.

> Regarding serialization, why does it matter? Because you need to serialize and de-serialize data objects or models if you're adding cache to eg. a service layer.

that's a separate concern from object relational mapping. Do you have the notion that ORMs produce objects that aren't compatible with serialization?


> an ORM that does not directly interpret objects along one-to-many collections /many-to-one attributes is still an object relational mapper.

This is diluting the term ORM beyond any usable definition.

A bit like when people claim that any programming language where you can associate a function with data and have the function implicitly get passed a reference to the data through data.function() call syntax is an OO language.

First off, I question your definition of "object". Is a C struct an "object"? Is a python tuple an "object"? I know that in python it literally is an object, but so is a function. But obviously your definition of an object must be separate from any given language.

To that end, does python's sqlite3 interface constitute an ORM? It converts the results of SQL queries to tuples. What about if I change the row_factory to sqlite3.Row? Is it now an ORM?

Where do you draw the line between something which is and isn't an ORM?

What is the point of such a vague definition?

What isn't an ORM in this case?


> that's not what "relational" means. "relational" means, "a relational database". where we are using SQL statements to deliver data to and from such a database. an ORM that does not directly interpret objects along one-to-many collections /many-to-one attributes is still an object relational mapper.

It doesn't. It means it works as a virtual object database that honors relationships on the underlying schema. The fact that the schema is relational is an implementation detail. Many ORMs will work with non-relational databases, but will honor underlying relations transparently if they exist, such as eg. SqlAlchemy on MongoDB.

An object mapper isn't even a "database-related" technique. Its just a mechanism to map data into objects - ex. JSON deserialization into object attributes is object mapping. When used in conjunction with traditional database mechanisms (such as eg. repository pattern and query builders), can be used to minimize data conversions between systems.

> that's a separate concern from object relational mapping.

Well, it is because in Django ORM or SqlAlchemy you use differerent declarations of data structures for data received from requests, data passed around and data read from and written to the database. It is actually a very leaky implementation where eg. when designing models you feel like you're working on a business object representation, but in fact you're still catering to implementation details of the relational model, such as primary keys, nullable fields, indexes and so on and so on.

> Do you have the notion that ORMs produce objects that aren't compatible with serialization?

That is precisely the point. This is not a feature, it is a quite important limitation. If you use data objects, this issue (as well as complex mapping routines) goes away, because then these objects only hold data and not business logic. These concepts aren't really mainstream in python, but are bread and butter in many other languages.


So many people say this, but I have to imagine that such people haven't actually really tried to write SQL in their services and avoid the temptation to introduce all sorts of abstractions.

Because in my experience this doesn't happen. You just have to be OK with some duplication (which most engineers over-focus on to the point of serious detriment).


In my experience I did what you describe above for about ten years then I wrote SQLAlchemy, tired of writing the same code over and over again (and successive projects certainly gained more and more abstraction as the constant repetition became more obviously a huge waste of time and verbosity). So that's one example.


I wouldn't call just mapping sql result to struct as ORM.

I'd draw a line at generating queries that are more trivial than "get a record by field" as ORM. Below that is just syntatic sugar over raw SQL.


It's extremely easy to write SQL queries that take a lot longer than they look like - an unindexed full table scan looks exactly the same as an indexed join, for example, whereas in a good ORM they will look different. So as far as I can see writing SQL manually is just extra drudgery for no real gain.


I don't know.

Hand-written SQL means a bunch of work and you have to know SQL. But it is simple and transparent.

ORMs automate a lot of simple and repetitive SQL, but to use them effectively you really have to know SQL extremely well and understand the ORM deeply as well.

So I guess it depends on what you are doing. ORMs can be useful but they require a lot more knowledge to use effectively than hand-coded SQL does.


> ORMs automate a lot of simple and repetitive SQL, but to use them effectively you really have to know SQL extremely well and understand the ORM deeply as well.

I'm not sure. You need to know the relational model pretty well, but you don't have to remember the zillions of quirks and edge cases or differences between dialects that SQL has. IME that's what takes most of the memorization effort.


> an unindexed full table scan looks exactly the same as an indexed join, for example, whereas in a good ORM they will look different

ORM queries compile down to SQL, so how would they look different?


Typescript compiles down to Javascript but an unchecked cast looks different from a known-safe assignment in Typescript even though they look the same in Javascript.


We're talking about something that exists at the query planner level, not a new feature introduced on top of that. Whether to use an index or a table scan isn't chosen by the user, it's chosen by the engine when the SQL is run. ORMs don't have any special hook to "look different" - if the database engine wants to do a full table scan, it'll do it whether the query came from manual SQL or an ORM, because they look the same to the database.


Depending on the database, the ORM may well add hints for the planner (of course it's theoretically possible to do that for manual queries as well, but it's usually verbose and database-specific). At a minimum the ORM has a model of what indices exist and so it can know which queries "should" be using an index, and may even have an internal model of how the query planner makes those decisions; of course ultimately if the query planner is perverse enough then even an automated system may not be able to get it to do the right thing.


ORMs generally give you a lot of nice things, and you usually (always??) can just write pure SQL and use the models you've defined (and all those nice things).

So, use an ORM, but write SQL if you want? Sounds like a good idea, actually.


An ORM is about more than mapping results to types, in Djangos case you get a powerful DDL generator, migration management, constraint validation when saving, DB portability, ...


Automated migration management is probably the most important part of Django's ORM.


You're on the right side of the bell-curve meme my friend, but there are a lot more people in the thick-framework camp who spend their days getting lost in the complexity of ORMs and related tech...


It's more like the world is not black and white, engineering problems don't have "The perfect solution, the rest is trash", but rather "this problem has multiple solutions, depending on context, some have these tradeoffs and the others have these".

In this particular case, it might not be worth to trade speed of having to think about SQL for performance (today or tomorrow). Maybe you're building something that will just be used by 2-3 people, so 1+N isn't really a issue.

Or whatever, the conclusion as always is: it depends.


I agree it depends, but my hot take is almost universally the time people like to say ORMs save them they end up paying back in spades debugging them. Learn SQL!


I really think it is more a matter of exposure and familiarity than bell-curve positioning. If any backend engineer with 1 year of ORM experience had spent that year instead becoming familiar with SQL, the speed bump would be practically nil.


In my experience, the far more pernicious way to get N+1 is serializers (Django REST)


A couple of years ago I wrote a set of utility classes (which inherit from Serializer and ViewSet) which "solve" this problem by inspecting serializers at the beginning of the request and figuring out what to pass to `select_related` and `prefetch_related` "automatically". It supports nested serializers, N-N, etc. Also lets you "help it" by saying "assume attribute X of a serializer accesses fields X,Y,Z of its object", for more "sophisticated" cases.

It's a very messy piece of code but it has survived many projects since I first wrote it 6 years ago. The day I enabled it at a previous job, we reduced a page load from 20s to 800ms or so just with it.


GraphQL makes it even more fun :)


Huh, ran into same issue on a rails project; queries kicked off in inside serializes were a weed. We toyed with throwing errors if a query was made during serialization.

On top of that, everything about serialization is terribly slow in Rails :|


Ugh yes, nested serializers in DRF are always fun.

Almost makes you want to go noSQL.


My Chaotic Good take on this: one could implement a qs.auto_fetch_deferred() that emits model instances with weak back-references to a WeakSet of all instances emitted, and on a deferred get on ANY instance, it prefetches that attribute onto ALL of the instances... so that it doesn't just complain, but actually fixes your 1+N issue. But here lies absolute madness...


Actually there is such thing already https://pypi.org/project/django-auto-prefetch/ Not a monkey patch though, so will only work on the models you inherit from this.


I agree; Django (and many ORMs) have taken the opinion that if you load N instances of a model, and then you load a related field on 1 of them, you likely only want the 1 related field. If however they assumed that loading a related field implies you likely want all related fields, perhaps there’d be fewer instances of foot-gunning.


On the topic of accidentally doing lots of extra queries, I love using Model.objects.raw to control exactly how it’s retrieving model data. I love how it keeps the results inside the Model realm but gives you careful control.

I also like that if you access fields you didn’t ask for, it’ll go get them. But I wish it screamed louder when this happened. “Your logic works but we had to do extra queries. This might be an error!” So the featured article is incredibly valuable. This ought to be built-in.

Django Silk has been critical for discovering these cases. They’re too easy to do.

I love the middle ground of “ORM but you write the SQL.”


Laravel has had a similar feature for a while[0]. It’s been useful to enable this in development but only warn in production to prevent breaking things unexpectedly.

[0] https://laravel.com/docs/10.x/eloquent-relationships#prevent...


Usually it's the wrong pattern, but not always. If you have a very large dataset, it can be beneficial. You can make smaller transactions, smaller query results, and not fill up local memory. For offline backfills, or various reporting jobs this can be the difference between something that works, and something that doesn't.


For that fetching in chunks is usually a way to go. I use something of these usually https://handy.readthedocs.io/en/latest/db.html#queryset_iter...


Shameless plug - I ran into this while developing REST interfaces with Django and built django-auto-prefetching: https://github.com/GeeWee/django-auto-prefetching

It essentially travels your DRF serializer tree and builds an auto-prefetched query automatically without you needing to do any work.

Back when I still worked actively on it, I wanted to monkey-patch models to track whether or not n+1 was happening, and if it was, automatically do pre-fetching, so instead of an n+1 problem you'd end up with just a "3-4 queries when it could've been 1" problem - which is much more palatable. Never got around to that part though.


In our services I implemented a per request query counter that gently warns you if you exceed a max query count. Use case: identifying when it might make sense to use DataLoader. Note: doing this isn’t always worth it.


I like this idea. I would just log the query count at the end of each request without any pre-defined limits. IFF you see performance drop, then you can always investigate and see if anything is triggering greater than expected calls.

Nifty idea, and probably not too hard to implement.


Heh, I'm literally in the middle of optimizing some N+1 query endpoints in a Django application for work, made a bit more tricky because of DRF's serializer.

I think a setting for lazy queries would be a good solution, with it enabled by default to ease the transition. It would be nice to have a couple options, though – allow, warn, and error.

It would also be great to have a way to change the setting on the fly so that, e.g., the Django shell can automatically enable it for those quick debugging sessions.


On an unrelated note, Python folks should check out OP's library funcy [1]: "A collection of fancy functional tools focused on practicality. Inspired by clojure, underscore and my own abstractions."

Thanks for the library Suor! I've used it happily for many years!

[1] https://github.com/Suor/funcy


I was working on a library to do this which I never got around to fully finishing but the idea is there: https://github.com/lime-green/django-orm-plus I think all I had left was polishing the auto add logic and testing with a real project


This raises a question, are there any easy batching library in Python? It's like N+1 problem but for non-SQL.

I have various RPCs, some of them accept multi actions with certain limits so how to write non-batch calls in code but execute them as a whole as needed?


TIL: > With something so innocent as an attribute access making an SQL query, it’s much easier to miss it.

I like Python as much as the next person, but this is highly irresponsible design decision.

Starting to appreciate Scala's IO effects even more.


This is not a Python problem, it's a design decision of Django's ORM specifically.


If you're going to do this, you may as well simply not use an ORM. Which is definitely the solution that I'd recommend; they are just not actually a good idea.


The ORM provides a myriad other features, like adapters for every production database under the sun, query composition that is literally impossible in plain SQL, a reasonable interface to the admin and the ecosystem of Django apps, and above all: a logical interface that maps _business objects_ to their SQL tables.

The ORM hate seems to come from people whose day to day interaction with data tables isn't mediated by seeing them as business entities where all the goodness of object-oriented programming actually shines.


... Or by people that actually understand the impedance mismatch between objects and data (quick django example - request data and models are different and not easily interchangeable). Or people that require good caching implementations. Or people that actually design database systems schema-first. Or peoplw that rely on advanced usage that isnt always easy to perform in orm's. The list goes on.


Django's ORM is much better than most, to the point where Django considers that "I can't model this query with the ORM" to be a bug.

There are of course some mismatches, but it's pretty hard to have a query that is not at all modelable, and Django's ORM is ... fairly predictable (I have some gripes about how obvious or not joins are but it's subjective).


There is a subtle difference between "can't be done" or "I'll spend an afternoon digging through documentations, code & examples to implement this".

Quick obvious example - use views for data retrieval (that may contain more fields than the actual model) and tables for data insertion.


In that case for the SQL, you would be pulling in data from one table, putting it into another. in the ORM I would similarly model it by having an abstract model to hold the general shape of the table, and then one read-only model and another for writing. With some helper functions to pull the data from one or the other.

Though if you are working with views and not materialized views, performance wise "use .annotate when fetching data" will be roughly as performant as using a view. Views just end up getting swapped out when the planner deals with the query (in Postgres at least).

Of course I get this cuz I spent a lot of time with the ORM. I think it's easy for people to think "there must be a magic option somewhere in the ORM to do this", but sometimes there's no magic. The counterpoint, though, is you have Python behind all of this so stuff like "share schema definitions between classes" is easy and straightforward in my experience (including on projects with like...80+ models, some with way too many fields, so I get your pain). It might not be the exact API you want though


> In that case for the SQL, you would be pulling in data from one table, putting it into another

In django ORM lingo, that's declaring 2 models, assuming they're even in the same namespace or app. plus the request you use to fill them. If you don't find this awkward, its on you, not me :)

> Views just end up getting swapped out when the planner deals with the query (in Postgres at least)

Huh? do you actually know how views work? You don't even have an easy way (non-sql) of declaring views in Django. And apparently, you also seem to doesn't seem to grasp implementation details regarding views and materialized views - views are "server-side cached queries", but materialized views are physical pre-computed tables. Also, in PostgreSQL (as well as many other databases) a view can actually hold data from several tables, including remote tables. The whole "this is a RDBMS system and we shall abide by it" went through the window the moment I can use a SQLish database to query eg. CSV files and/or import them as local "tables".

> Of course I get this cuz I spent a lot of time with the ORM

Don't get me wrong, but it seems you're not spending enough time. Let me enlighten you using a personal anecdote: a project management solution where you log in your tasks during the day, the hours and it would keep track of the collaborator's project allocation during the project execution time. Each collaborator (from less than 100) would introduce from 1 to probably 20 tasks per day, on the projects they were allocated. Reports were generated - per project, daily, monthly, etc- you get the point. Obviously, those reports were computed using python and the ORM - so at some point, getting eg. a company-wide report with a couple of projects for a year would trigger some specific conditions that made the report take more than 10 minutes to generate. A dataset I could tabulate in excel (couple of hundred thousand lines). Half of the time was actually spent on allocating objects in memory for stuff that had 1 field read. Of course the report routine reused the existing code functions for specific computations, that increased 10-fold the execution and memory allocation. The 20-line sql query that replaced hundreds of python lines executed in less than a hundred milliseconds. I could blame the ORM, but instead I blame the application design that follows ORM constraints. Someone detached from a data source would tell you "the parent API needs to provide that" - instead, you use what you have at hand. Just because it works, it doesn't mean its good.

> o stuff like "share schema definitions between classes" is easy and straightforward

That is actually the shit-show design I always want to prevent. Classes should not share schema definitions, but data formats (data objects). In the specific context of Django, Models are a piss-poor data object representation (because it is a quite decent Model implementation, mind you), and the whole apps scaffolding is just... nuts. The notion of apps are self-contained units of functionality, but they quickly become a cesspit of cross-referencing imports. Rule of thumb in most languages, if different "modules" of the application need to share a schema definition, you're doing it wrong. But heyyy, Python.


Alright I'm not going to engage with you on this, except to say that I've also written a report generator that does the whole "aggregate by time bucket and certain conditions across multiple tables" and it was in the Django ORM and it would end up as a single query that looked like what the SQL I would have written is. Lots of `query.explain()` calls locally to make sure I was getting what I want, but hey I'd be doing that for SQL in perf-sensitive queries as well.

I'm partial to the ORM being easy to lull you into writing quadratic-behavior stuff. You still gotta know how your DB works! But "mostly using the ORM, using SQL for some tricky things" works really well, especially given that the Django ORM lets you inject SQL fragments for a lot of behavior and have that "just work".


Extending on impedance of objects and data, and validated request data being different from models, imagine this pseudo-code:

function action_endpoint():

     if request.is_valid():

        data = request.to_data_object(data_object_class)

        self.service.update(data)

        return success()

     return request.errors()


in this simple example, the internal data representation isn't using a full-blown object, but a "data object" (ex. a dataclass). There are no transitive database dependencies, it behaves just like a fancy dict. When including data_object_class, I'm not including the whole database driver. When passing this to other system components, this can be serialized & de-serialized because it has no intrinsic behavior implemented. As such, when using architectural patterns like three-tier design or hexagonal design, you can pass data between layers without any "external"(from a data perspective) dependency; this allows the frontend to be completely agnostic on where & how data is stored. In fact, in this example, self.service could be an RPC proxy object to another subsystem, in a different server. The advantage of this design becomes quite apparent when you need to decouple how data is stored from how data is processed - you start designing your application in a service-oriented approach, instead of a model-oriented approach.

In fact, one could just create an endpoint processor that receives a table name, and infers the rest of the logic in the middle (the request validation, the data object, the glue service for database), that today can write to a database, and tomorrow just calls an api without rebuilding your application.


The list goes on an on and yet in practice these problems are solvable and the impedance is just really not a big deal.

There is only one feature missing in the ORM which is composite primary key. For everything else all those things have clear and simple solutions.

"Impedance mismatch" is just a thought-terminating cliché. High-level languages have impedance mismatch with binary code; reactive components have impedance mismatch with state, relational tables have impedance mismatch with hierarchical data. Yet we find solutions and workarounds and the severity of these problems is generally overrated outside of purely theoretical contexts.


> and the impedance is just really not a big deal

Im quite happy you haven't come across major issues with this. If you develop clean architecture solutions that are live products across years, this is a major problem (eg. table X is now a separate full blown service; table Y is an external materialized table with no inserts, as inserts now go into an external messaging system such as kafka) etc.

> "Impedance mismatch" is just a thought-terminating cliché.

So are the whole ORM advantages. My personal distaste from ORM doesn't even start in the obvious technical drawbacks, starts with the fact that a developer should have a solid grasp on the domain he is working, which more often than not, ORM advocates lack. If you can't model data from a storage perspective (which, btw, is often the bottleneck of your application), you shure as hell won't do a good job modelling it in a business domain.

> Yet we find solutions and workarounds and the severity of these problems is generally overrated outside of purely theoretical contexts.

Ahh yes, the typical "lets not get theoretical" argument. ORMs are usually crap, and in python they are actual crap. If Django is a good example for you, good for you. If you ever have a look at Entity Framework you'll be amazed. Try to use a schema-first approach with any mainstream ORM and you'll quickly realize all you do is workarounds because of assumptions and limitations. Thing is, for my daily work, these problems are actual problems. So much we don't use Django or ORMs.


ORMs are just not performant unless you reason about all the code at the level of "what queries are going to be generated and when", which makes the ORM an unhelpful layer of obfuscation over the layer of abstraction that you're actually reasoning at.

This is quite different from high level vs assembly where you can easily go your whole life without ever learning assembly language or how a compiler works.

Or to put it another way, the difference between the two situations is that an ORM API is not a higher level language than SQL. Transpiling between two languages of comparable expressiveness (SQL is actually more expressive but no need to go there) adds an extra source of problems without gaining you much.


> ORMs are just not performant unless you reason about all the code at the level of "what queries are going to be generated and when", which makes the ORM an unhelpful layer of obfuscation over the layer of abstraction that you're actually reasoning at.

You're assuming I use the ORM to not reason about SQL or not think about performance. This isn't true; first of all because even if you write SQL, SQL performance is not immediately obvious for any but the simplest of indexed queries. In no storage system do you ever get away from reasoning about this.

Second because SQL is actually a mediocre abstraction layer over your data storage. You can't really compose SQL queries; in an ORM taking a base Query object and adding a bunch of various `filter()` statements automatically does the right thing. Basic queries are much shorter visually; ORMs deal with the abstraction of table and column renames that mean rewriting all your SQL in other systems. I feel like you're just trotting out "reasons" out of a blog post from people whose priorities aren't the ones that people like us who write CRUD systems day in and day out do.

Again, you're talking about theoretical disadvantages which I have only really encountered about a half dozen times in over a decade of using Django even in performance-sensitive areas. Rewriting one ORM query out of a hundred is not a problem, especially if I had to rewrite the SQL in the first place.


> Second because SQL is actually a mediocre abstraction layer over your data storage

Actually, SQL is not an abstraction layer for data storage. It is a query language based on set theory. Even DDL doesn't impose on you any limitations or makes any assumptions regarding storage - just data representation. You're conflating SQL RDBMS with SQL language, just like many ORM's do. Quick example, you can use SQL with Apache DRILL to query log files in CSV - SQL Language, none of the RDBMS stuff.

> You can't really compose SQL queries;

in an ORM taking a base Query object and adding a bunch of various `filter()` statements automatically does the right thing. Basic queries are much shorter visually

Of course you can. But even without needing to explain to you how one of the most famous anti-patterns work (EAV - entity-attribute-value), the most easy way of doing it is by using a query builder in your own development language - you just filter based on the specific code contitions.

Oh, and "compose" is a terrible term, specially when SQL actually allows you to use SELECT * FROM (SELECT QUERY), LATERAL and UNION - all of these allow you to perform proper composition, years ahead of what most ORMs give you.


> In no storage system do you ever get away from reasoning about this.

Agreed

> Second because SQL is actually a mediocre abstraction layer over your data storage.

This is fair but adding yet another layer only makes things worse

> I feel like you're just trotting out "reasons" out of a blog post

Rest assured that I haven't read a blog post or indeed anything on the topic; in fact I'm shockingly ignorant in general.


> This isn't true; first of all because even if you write SQL, SQL performance is not immediately obvious for any but the simplest of indexed queries. In no storage system do you ever get away from reasoning about this.

True, but ORM adds yet another layer of cruft to debug and monitor, with arguably few benefits for such an advanced user; Also, in some databases, EXPLAIN is your friend, and may not give you execution time (it does, but lets assume you're right just for the sake of it), but it does give you planned "execution cost". 0.1 is better than 0.5, and so on and so on. Also, it will tell you if your indexes are actually being used (are you a mysql user? that would explain a lot).

Regarding performance, you have 3 main areas that are costly: query execution, result retrieval time, result serialization & transformation; The first two are characteristics of the design of the database and the specific system characteristics; the last one is pure framework-dependant. If you're a python buff, you'll quickly realize that your application will spend a non-trivial amount of time getting around proxy class implementations for models in the deserialization phase of the data - in some cases, more than the query and transport time itself. All because you eg. wanted to compute spent minutes in a given task for a given user for a year, counted in hours or melons or whatever.

> that people like us who write CRUD systems day in and day out do

I write a shit-ton of CRUD systems (in python), and none of them use Django, because interoperability is often desired, and in some cases - a requirement. Just because you use Django and a ORM to design some stuff, doesn't make it wrong, but also doesn't make it right. Want to design database CRUDS without code? There are plenty of tools for that, ranging from bpm tools like Bonita BPM (OSS) to OutSystems.

>Again, you're talking about theoretical disadvantages which I have only really encountered about a half dozen times in over a decade of using Django even in performance-sensitive areas.

Or - or - we're discussing problems you don't have because you work in a very narrow field where Django is actually a good fit. They exist, and nothing against it. But the mustard is on the nose if you tell me you've been spending "over a decade" with Django. In "over a decade", I've written a shit ton of libraries in at least 4 languages on this specific topic (database middlewares, query builders, object mappers, etc), and the driver for all of them was to solve problems you describe as "one in a hundred".

> Rewriting one ORM query out of a hundred is not a problem

Actually, it is. Because its not *one* query, its one endpoint - it may be using a dozen models, on a dozen child subroutines, to compute a specific value; It may be actually using a function on a different app for a specific computation; It may actually happen that the local computation has certain characteristics the replacing query doesn't because it is used server-side - sum of times; sum of decimals; sum of floats; And now every time someones edit that model assuming "that's it", they either have a crappy abstraction model and realize they also need to edit some method's sql queries by hand (oh and patch the migrations, if necessary), or they're just playing whack-a-mole with the assorted array of custom functions breaking up in unit testing. In the end - assuming all things are equal - I would very much prefer if architecture and concern delegation wasn't affected by some performance-related refactorings.


In iommi we scream at you in the console if you have N+1 issues. It's not as harsh as just banning it.


Elixir/Phoenix does this by default. It is a good default.


Lazyloading has actually no place in any Framework.


sentry.io is pretty good at catching N+1 queries.


Sentry has a lot of GDPR problems, it's not easy to set it up so you don't leak PII in e.g. traces sent to it.


You can self-host the open-source version, it's dockerized and pretty easy to set up...


Correct. Sentry on web and Firebase on mobile app are often storing troves of Personal Data captured by developers.

Neither is meant for PII/Personal Data processing and are huge compliance risks.


There is a problem with the URL. I think this is the correct one https://suor.github.io/blog/2023/03/26/ban-1-plus-n-in-djang...


HN keeps automatically replacing the URL. It used to be a redirect before, but not anymore


HN has a feature that uses the canonical address in the webpage. Is your page configured correctly?

Looking at the source:

  <link rel="canonical" href="http://hackflow.com/blog/2023/03/26/ban-1-plus-n-in-django">
Don't repost it again (for now). Try fixing the canonical link, and send an email to the mods hn@ycombinator.com with a short explanation and a link to this post https://news.ycombinator.com/item?id=35313565 to save them a few minutes searching. They may fix it using admin magic or ask you to repost again once the problem is fixed.


As gus_massa pointed out, HN's software uses canonical URLs when it finds them. The canonical URL on the page you submitted was http://hackflow.com/blog/2023/03/26/ban-1-plus-n-in-django, so our software used that. I've fixed it above now.


Thanks




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: