> Why even have a SUM function, when it is not accelerated by an index, that makes using it automatically non-scalable.
Because it’s useful and that last part isn’t true? Databases do more than indexes and you have trade offs regarding the impact of adding too many indexes.
It's also not only not a problem for scalability but in reality an important way to _improve_ scalability. Consider a simple example where we do "SELECT SUM(price) FROM orders GROUP BY customer_id". Removing SUM() from the language would massively increase the amount of data which needs to be processed into the response — which makes anything else you're doing like sorting harder – and it prevents various optimizations the database engine might make. For example, on AWS RDS Aurora the database pushes all of that down to the storage nodes so each storage node will return the sums for the records which are on that node and the main node can sum those intermediate values without needing to transfer a single source row over the network. Since SUM() is part of the language and well-defined, that's safe for one database team to implement without the risk of their results not matching a competing database.
Other things to think about are computed views and stored procedures: in both cases, there are situations where these are dramatic performance improvements or otherwise desirable and having a richer language means that those constructs can solve more problems.
But even SUM in AWS RDS Aurora is not optimal in terms of performance. You could do much better by having partial SUMs in the index btree nodes, that way you could get the sum of billions of records without processing any row individually. CouchDB supports this (even though it's by far not the most efficient database). I wish Postgres and others provided access to that sort of low level map-reduce index functionality. Transferring large amounts of data in the response is solvable by using shared memory, so I'm not concerned about that. And a SQL like language could be provided on top of that. If a low level interface was standardized you could even have SQL like languages that are portable between DBs.
> You could do much better by having partial SUMs in the index btree nodes, that way you could get the sum of billions of records without processing any row individually.
It sounds like you don’t want a general purpose database. Something like what you describe might be more efficient on certain queries but only because it’s giving up a lot of flexibility - consider what happens if you add any filtering or grouping changes and those partial sums can no longer be used.
This preference for general systems rather than optimizing for a single query shows through in the tools used for this kind of task: materialized views and indexes using expressions. Those would allow the kind of precomputed aggregates you’re talking about without the kinds of challenges around flexibility and correctness your approach would encounter.
> Transferring large amounts of data in the response is solvable by using shared memory
That’s a common optimization but it’s not a panacea any more than indexes are. For example, relying on that falls over badly once you need a network and the problems I mentioned before are still important — even if your shared memory implementation is perfectly efficient, you’re still looking at keeping data in memory longer so your client can aggregate it when the database engine could do that for you and discard the memory immediately.
Another way of looking at this is to ask why a field with billions in R&D hasn’t already done it, and especially why various big data tools have been significantly less impactful than promised. If you see what seems like an easy way to beat the current options, it’s usually a good time to ask why everyone working in that space hasn’t seen it.
> Something like what you describe might be more efficient on certain queries but only because it’s giving up a lot of flexibility - consider what happens if you add any filtering or grouping changes and those partial sums can no longer be used.
If you're doing a query that is not backed by an index, it's not scalable, so you could also do it in Excel.
> This preference for general systems rather than optimizing for a single query shows through in the tools used for this kind of task: materialized views and indexes using expressions.
Materialized views totally suck (at least in postgresql), because you cannot update them incrementally (yet). A better option most of the time are triggers and aggregate columns, but it requires a lot of micromanagement.
> It sounds like you don’t want a general purpose database
I don't think I want a special feature, map-reduce indexes seem pretty general "mathematically" sound thing.
> Another way of looking at this is to ask why a field with billions in R&D hasn’t already done it, and especially why various big data tools have been significantly less impactful than promised.
Well that's what's exiting about research, that some things haven't been done yet. Even though they may seem fairly obvious.
Because it’s useful and that last part isn’t true? Databases do more than indexes and you have trade offs regarding the impact of adding too many indexes.
It's also not only not a problem for scalability but in reality an important way to _improve_ scalability. Consider a simple example where we do "SELECT SUM(price) FROM orders GROUP BY customer_id". Removing SUM() from the language would massively increase the amount of data which needs to be processed into the response — which makes anything else you're doing like sorting harder – and it prevents various optimizations the database engine might make. For example, on AWS RDS Aurora the database pushes all of that down to the storage nodes so each storage node will return the sums for the records which are on that node and the main node can sum those intermediate values without needing to transfer a single source row over the network. Since SUM() is part of the language and well-defined, that's safe for one database team to implement without the risk of their results not matching a competing database.
Other things to think about are computed views and stored procedures: in both cases, there are situations where these are dramatic performance improvements or otherwise desirable and having a richer language means that those constructs can solve more problems.