> Why even have a SUM function, when it is not accelerated by an index, that mak...

naranha · on Aug 25, 2022

But even SUM in AWS RDS Aurora is not optimal in terms of performance. You could do much better by having partial SUMs in the index btree nodes, that way you could get the sum of billions of records without processing any row individually. CouchDB supports this (even though it's by far not the most efficient database). I wish Postgres and others provided access to that sort of low level map-reduce index functionality. Transferring large amounts of data in the response is solvable by using shared memory, so I'm not concerned about that. And a SQL like language could be provided on top of that. If a low level interface was standardized you could even have SQL like languages that are portable between DBs.

acdha · on Aug 25, 2022

> You could do much better by having partial SUMs in the index btree nodes, that way you could get the sum of billions of records without processing any row individually.

It sounds like you don’t want a general purpose database. Something like what you describe might be more efficient on certain queries but only because it’s giving up a lot of flexibility - consider what happens if you add any filtering or grouping changes and those partial sums can no longer be used.

This preference for general systems rather than optimizing for a single query shows through in the tools used for this kind of task: materialized views and indexes using expressions. Those would allow the kind of precomputed aggregates you’re talking about without the kinds of challenges around flexibility and correctness your approach would encounter.

> Transferring large amounts of data in the response is solvable by using shared memory

That’s a common optimization but it’s not a panacea any more than indexes are. For example, relying on that falls over badly once you need a network and the problems I mentioned before are still important — even if your shared memory implementation is perfectly efficient, you’re still looking at keeping data in memory longer so your client can aggregate it when the database engine could do that for you and discard the memory immediately.

Another way of looking at this is to ask why a field with billions in R&D hasn’t already done it, and especially why various big data tools have been significantly less impactful than promised. If you see what seems like an easy way to beat the current options, it’s usually a good time to ask why everyone working in that space hasn’t seen it.

naranha · on Aug 25, 2022

> Something like what you describe might be more efficient on certain queries but only because it’s giving up a lot of flexibility - consider what happens if you add any filtering or grouping changes and those partial sums can no longer be used.

If you're doing a query that is not backed by an index, it's not scalable, so you could also do it in Excel.

> This preference for general systems rather than optimizing for a single query shows through in the tools used for this kind of task: materialized views and indexes using expressions.

Materialized views totally suck (at least in postgresql), because you cannot update them incrementally (yet). A better option most of the time are triggers and aggregate columns, but it requires a lot of micromanagement.

> It sounds like you don’t want a general purpose database

I don't think I want a special feature, map-reduce indexes seem pretty general "mathematically" sound thing.

> Another way of looking at this is to ask why a field with billions in R&D hasn’t already done it, and especially why various big data tools have been significantly less impactful than promised.

Well that's what's exiting about research, that some things haven't been done yet. Even though they may seem fairly obvious.