Hacker News new | past | comments | ask | show | jobs | submit login
A Critique of SQL, 40 Years Later (carlineng.com)
187 points by mariuz on Aug 24, 2022 | hide | past | favorite | 253 comments



The only thing that affects my daily life with SQL is that FROM should be before the SELECT keyword. This would _greatly_ improve type-ahead support in SQL IDEs.

Nothing is perfect, but that is really the main beef. Another commentor already nailed having a LIMIT WITH ERROR clause to be specified on UPDATE,DELETE statements and explicitly throw an error otherwise.

SQL is on of my favorite tools to use and I don't see it getting replaced by anything any time soon.


I started my career in the 90s writing ROLAP engines, and even though I've spent most of my time since doing "web" development, I still seem to end up having to build engines that generate SQL queries that are dozens of lines long.

The complaints about SQL composability are real. The grammar is fundamentally pretty irregular. Acceptable for humans to type ad-hoc, crappy for computers to generate.

You can like what SQL does for you (I do!) but it's also easy to imagine something a bit better. I thought the article was spot-on. I hope some future SQL x.0 will fix these issues, but also be similar enough to present-day SQL that I don't have to learn a whole new language from scratch.


> I hope some future SQL x.0 will fix these issues, but also be similar enough to present-day SQL that I don't have to learn a whole new language from scratch.

If SQL is to Javascript then something akin to WASM would be nice. Lower level primitives that allow query systems to be built. SQL could be one such system, but not limited to.


Every time SQL is mentioned on HN someone comes to complain about FROM coming after SELECT. I use SQL every day and not a single time have I found reason to complain about it. Can you give a bit more detail about what's wrong with it being like it is ?

EDIT : thanks all for your reply. I now understand that it is an IDE related thing not something fundamental to the language.


If you exploring a set of tables you have never touched before, it really neat if you could just type in:

FROM tablename t SELECT t.<press tab>

and some form of autocomplete mechanism, either prefills all the column names from table "t" or suggests the list of columns and/or types associated with it.

This is much better than having to: 1. Run a SELECT with LIMIT statement just to get an idea of the layout. 2. Point and click through the IDE treeview.

Honestly, I don't think it helps a whole lot beyond this functionality, but I can see why folks who are accustomed to thinking in functional pipelines (from -> select -> map -> filter -> collect) can prefer this way of querying.

I think PRQL is one attempt at building something this way.[1]

[1]. https://github.com/prql/prql


In SSMS and almost certainly many other SQL editors Intellisense works on the select list as soon as you have a working from clause, the only caveat being you can’t autocomplete columns reading when writing a statement purely left to right.

But I think that’s not a big deal at all, and the SQL approach has a certain advantage in putting the type of action (select/update/delete/drop etc) front and centre, which is really quite helpful.


SELECT FROM widgets VALUES id, factory_origin_id, frobbable_knobs


DESCRIBE TABLE is a command that pretty much does exactly this (explain what a table contains) and it's a part of MySQL.

If you use PostgreSQL then you can use \d instead.

I'm sure the other RDBMSes have their own equivalent (except for maybe SQLite but if you're using SQLite outside of toy environments or hobby projects, you're doing it wrong).


> except for maybe SQLite

.schema

> if you're using SQLite outside of toy environments or hobby projects, you're doing it wrong

That's an uninformed statement. SQLite is extremely solid production quality code. Of course it's not the universally applicable database solution, nothing is. Sometimes you need Cassandra or its kind, often MySQL|Postgres, other times SQLite is the correct answer.


It's kind of funny to see a claim that the most widely deployed database in the world is useful only for toy projects.

https://www.sqlite.org/mostdeployed.html


I guess there are more toy projects than serious ones.


Yeah, I mean MacOS, Android, iOS, Chrome, Firefox ... all typical toy projects.


sheesh lets hope there are no vulnerabilities in there


SQLite has an 608 times more lines of code for tests than the original code![0] I’d wager it’s the highest test/production ratio there is.

[0]: https://www.sqlite.org/testing.html


Most of those tests are generated. It's wrong to focus on this metric IMO.


Auto generated or not, the point being made, in regards to @bot41's comment, was that SQLite is thoroughly tested.


It's well tested, but that's still not a guarantee that there are no security issues. Latest acknowledged CVE bug is a month old: https://www.sqlite.org/cves.html


> if you're using SQLite outside of toy environments or hobby projects, you're doing it wrong

https://www.sqlite.org/mostdeployed.html

> Every Android device

> Every iPhone and iOS device

> Every Mac

> Every Windows 10 machine

> Every Firefox, Chrome, and Safari web browser

> Every instance of Skype

> Every instance of iTunes

> Every Dropbox client

> Every TurboTax and QuickBooks

> PHP and Python

> Most television sets and set-top cable boxes

> Most automotive multimedia systems

> Countless millions of other applications

Look at all those toys.


everyone is doing it wrong :(


Not necessarily.

If you ponder the importance of proper (robust, reliable, dependable) data management for data that keeps nuclear plants going, for farmaceutical research data, for anything happening on the financial markets, for medical records, for data concerning payroll and the like, etc. etc. then you might appreciate that all the stuff mentioned in the list is indeed really "just toys".


You're missing the point: until you have typed the FROM TableName t in your IDE, SELECT t.<press tab> can not do autocompletion. That has nothing to do with whether SQL supports querying metadata (the ANSI portable way would be through INFORMATION_SCHEMA btw, not DESCRIBE). It's a consequence of the brain not responding to think-ahead queries by the IDE.


This may be true, but you've missed the point that this is about aiding autocompletion while you're writing the query the first time.


Couldn’t the IDEA understand

FROM table SELECT

And when done switch the code to be correct?


Every RDMBS has this, including SQLite.


in sqlite3 it is .schema


SQLite absolutely has production level applications, it's much more than a toy.

https://www.sqlite.org/whentouse.html


Everyone else is mentioning it from an IDE perspective, but let's also think about logically from a language perspective. When you start a FROM clause and add some JOINs, a few WHERE conditions and maybe GROUP BY, you are building a virtual view of a series of tables, columns, and aggregations. You could even define this data set as an ephemeral table. What you do with that data set afterwards might vary depending on the need, but the data set might not change. Depending on the application, you might select different columns from the data set. We do this naturally using a WITH clause at the beginning of a query.

WITH (combine a whole bunch of stuff) as dataset SELECT a, b, c FROM dataset

This approach just says:

FROM tables... WHERE ... SELECT a, b, c

To me, it does make a lot of sense. This is also the paradigm that some of the graph databases use.


You could even define this data set as an ephemeral table

Exactly! This is where SQL hurts me the most: not being able to store (partial) query expressions in variables for later reuse. The only way to do this is by creating explicit views (requires DDL permissions) or executing the partial query into a temporary table (which is woefully inefficient for obvious reasons).


Isn't that what a common table expression is? Basically a pseudo temp-table to break down queries. Of course, they also allow recursion, which you can't do with a temp table.


Yes, but a common table expression is still bound to only one query. You can use it multiple times within the same query, but you still can't save a common table expression in a variable and re-use it in multiple queries.

This is what I'd what to do if common table expression really were common:

  SELECT c1, c2
  FROM DifficultJoinStructure
  AS myCte;

  WITH myCte
  SELECT c1, c2
  WHERE SomeCondition(c1);

  WITH myCte
  SELECT c1, c2
  WHERE SomeCondition(c2);


So.. a view then?

Granted most environments effectively treat views as DBA/Sysadmin owned objects, especially where end users/apps are effectively sharing one, or a small number, of user accounts.

But given user=schema aspect several of the traditional databases, I get the impression the original intent might have been a little more laissez fair?

Of course the same can be said for tables, and that was perhaps a little idealistic!

Views aren't always quite as composable as you'd like either, or maybe I'm just scarred by the particular DB engines I use most.

So I actually agree with you, but unfortunately SQL requires that the "WELL AKSHWELLY" be followed by one or more "BUT" clauses.


Agreed, it is not just an "IDE related thing", it is also a logical arrangement of thoughts thing, a readability thing.


And as another commenter noted, the underlying relational algebra is also not in agreement, so it is definitely not logical. I believe they wanted to mimic human language statements, but that goal hurt more than it helped.


It makes way more sense honestly. It also looks a lot like a functional pipeline if you set it up that way.


> I now understand that it is an IDE related thing not something fundamental to the language.

No, is fundamental issue to the language!

The relational model is clear. You START with a relation and then compose with relational operators that return relations.

ie:

    rel | project
Sql do it weird. Is like in OO, where instead of define a class THEN define the properties, you define the properties THEN define the class.

And this fundamental issue with the language goes deeper. The rules are ad-hoc for each sub-operator despite the fact using relational model MUST make it simply to compose.

So, you have rules for HAVING, GROUP BY, ORDER BY, WHERE, SELECT and so on and none are like the others, are different in small but annoy ways...


Having SELECT come first makes sense to me because it's the only part of the statement that's required. FROM and everything else is optional.

Also when reading a statement, you're mostly interested in what the returned fields are rather details like where they came from or how they're ordered. It kind of makes sense to put it at the start.

Maybe other syntax forms have their benefits, specially when writing, but I don't think SQL's choice is completely senseless either.


> SELECT come first makes sense to me because it's the only part of the statement that's required.

Only *IN SQL*.

You don't need it on the relational model, heck, no even in any other paradigm:

    1
That is!. (aka: SELECT 1)

So this:

    SELECT * FROM foo
is because SQL is made weird. More correctly, this should be only:

    foo
Also, SELECT is not required all the time, you wanna do:

    foo WHERE .id = 1
    foo ORDER BY .id
    foo ORDER BY .id WHERE .id = 1 //Note this is not valid in SQL, but because SQL is wrong!
But you probably think this as weird, because SQL in his peculiar implementation, that is, ok for one-off, ad-hoc query, and in THAT case, having the list of fields first is not that bad.

But now, when you see it this way, you note how MUCH nicer and simpler it could have been, because then each "fragment" of a SQL query could become *composable*.

But not on SQL, where the only "composition" is string concatenation, that is bad as you get.


SELECT itself should be optional. Languages with expressions are fairly intuitive, e.g. "int x = foo.bar;" where "foo.bar" is equivalent to the "SELECT bar FROM foo;" SQL statement. I don't breathe SQL every day, so I'm struggling to come up with a case where removing SELECT results in parsing ambiguity.


> I'm struggling to come up with a case where removing SELECT results in parsing ambiguity

It's actually useful to the person reading the code. It clearly defines where a statement starts, what it does and makes reading a query close to reading English. Show a SELECT FROM WHERE query to someone who does not know SQL and the person will understand it. It might be a bit harder if you remove the SELECT.


It clearly defines where a statement starts

And this is another example of the ad-hoc problems of SQL: query terminators (;) are optional. If they weren't, there would be no abiguity where a statement would start: it's the first word after the previous terminator.


> I'm struggling to come up with a case where removing SELECT results in parsing ambiguity.

Oh! Is super-ambiguous! Make the parser and enjoy it!

Lets make this more concrete:

     city SELECT id
     city ORDER id
     city id <-- Order or select???
You could then "favor" projection as the most important than the others. Ok, so:

     city city city
Which is the table, or the field?


I didn't suggest removing ORDER or FROM. This still makes sense:

    id FROM city ORDER BY id
A plus is that sub-selects have more natural, expression-like syntax:

    id FROM (city WHERE elevation > 1000)


How do you know if you are SELECTing or DELETEing?


You mean the keyword; I meant the clause. "FROM foo" is optional to the syntax.


Select doesn't mean that tho. Not in the relational model. It refers to the filtering clause.

Following their stupid English syntax, but using the proper verbs it should rightfully be "PROJECT x, y FROM foo SELECT WHERE a = b"


Having learned Prolog before SQL, it was weird when it clicked that both were relational languages, but SQL decided to hide that underneath a natural language facade and the inconsistencies that come with it.


Yeah. And SQL does it very weird because it mixes up the meaning of "selection". Selection doesn't mean what most people seem to think it means. In the relational algebra "select" is really what is happening in the "where" clause. Projection is the name for the thing that chooses the attributes ("columns") to put in the final result.


Imagine a table `Foo` with the schema

`FooId, name, date, favorite_color, active`

Now, you want to pull the ID for the latest `foo` for a specific date, but you don't know any of the column names.

The modern workflow looks like this

You write

`SELECT * FROM Foo`

then you say, "Ok, now I can get autocomplete"

`SELECT FooId FROM Foo`

"Ok, now I can write the where clause"

`Select FooId From Foo WHERE date=?`

It becomes an exercise in moving the cursor around just to get the autocomplete going.

If you are really familiar with the schema, not a problem. But if you just remember a few details about it, then you are stuck in this weird back and forth cursor moving thing.

That's why it'd be more ergonomic to have something like

`From Foo Select FooId where date=?`

Because you never need to move your cursor and you could get all the autocomplete you need at the right times.

This becomes especially true when writing joining statements

    FROM Foo f
    JOIN Bar b ON f.FooId = b.FooId
    SELECT f.FooId
    WHERE b.active = 1


Like the commenter alluded to, it allows accurately constrainted type-ahead. If a query starts with "SELECT FROM my_table " and expects one or more column names at that point, your IDE can already suggest the column names from my_table (and not from any other table).


As sibling comments have mentioned, this is a valid complaint. It is useful to remember that SQL is an old language already, and there are plenty of warts that in all of this time have been observed. The same way C is old, and there are things to be apreciated about fresh attempts like Zig, Rust, Swift, Nim.


you seem to be too used to it to notice...

you start to type SELECT some, columns and can't get autocomplete until you add FROM afterwards, so you either type the query inside out (SELECT FROM table and go back to after SELECT) or just give up.


> thing not something fundamental to the language.

It is fundamental to the language. The evaluation order is from,where,group by, having, select, order by, limit.

Everything in perfect order is, select except.


I think this stems from a misunderstanding of the entire point of SQL. It isn't about looking at data in an individual table, it's about retrieving a Result Set for complex queries.

All the FROM-first examples I've ever seen are almost universally the simplest query in the world where autocomplete is not a large hurdle anyways because you aren't even bothering with table aliasing. As soon as you do anything even moderately complex (multiple joins, subqueries, calculations, etc.) the advantage of putting FROM first vanishes, and if you're adding a table alias or hard reference to the table you can already see what is in the SELECT list AND in many cases get autocomplete.


> I think this stems from a misunderstanding of the entire point of SQL. It isn't about looking at data in an individual table, it's about retrieving a Result Set for complex queries.

No misunderstanding at all. It’s just more natural to first describe what the sources of the data are, joins etc, and then afterwards which columns you’d like to retrieve, or what calculations you’d like to perform etc.

The current way of having select first is just plain dumb. When it comes to reading order, you never actually know what is selected until you’ve read through the from, where, group by and having clauses anyway, so you constantly have to jump back and forth between dart and end to see the context. And it also better matches the sql evaluation order to have from first and select is after all of these.

Select also just better fits with order_by and limit since it’s also about restricting results after you’ve gather them all up and stitched them together etc.


> When it comes to reading order, you never actually know what is selected until you’ve read through the from, where, group by and having clauses anyway, so you constantly have to jump back and forth between dart and end to see the context.

Let's assume this was accurate, and reframe it to discuss putting FROM first in the query:

    When it comes to reading order, you never actually know where what you've selected is coming from until you've read through the WHERE, GROUP BY, HAVING, and SELECT clauses anyways, so you constantly have to jump back and forth between start and end to see the context.
This is equally true if FROM comes first since you have no idea if customer_name in your SELECT way at the end actually comes from a table directly, a subselect, calculation, etc.

I would be interested in seeing what a complex WITH or subselect would look like if you're putting FROM first, and just how much actual clarity you're getting out of reordering it.

I think the benefit is nonexistent and the harm is that you're going to further fragment an already fragmented syntax space for that nonexistent benefit.


> This is equally true if FROM comes first since you have no idea if customer_name in your SELECT way at the end actually comes from a table directly, a subselect, calculation, etc.

When you get to the point where you see customer_name, you know what it came from because everything you read prior to that.

Imagine going into a shop to order three burgers and saying “the first one is without onion, the second is without onions and the third without picked, I’d like a big Mac, a quarter pounder and a hamburger please.” It’s just not the natural reading order.

> I would be interested in seeing what a complex WITH or subselect would look like if you're putting FROM first

The premis for this seems to be that your arguing that moving SELECT to the start of the query even though this isn’t he natural evaluation order nor reading order somehow helps to increase readability. But I don’t believe the there’s any reason to assume that. Just because it’s how things are currently and how you learned it doesn’t mean it’s better, it just means you’ve grown accostumed to it and therefore you feel it’s simpler. But everyone learning SQL from scratch who don’t initially learn actively that select isn’t applied in the order it’s written ends up making the same silly mistakes and being confused about the same things that would be obvious if we just put it where it actually belongs.


I feel this also follows the concept of narrowing down from a large set to a smaller set. Using FROM first is the large set and then you're selecting from it.


Joins would presumably still be in the from-clause, so you get:

   FROM a JOIN b JOIN c
   SELECT b._
Here aoutocomplete helps also with the joined tables. Autocomplete would also help in subqueries:

  FROM a JOIN (FROM b WHERE b._


That's a fair point. This comment made me realize my own tendency to write "select * from..." so that I could supply the "from" before going back and replacing "*" with specific columns.



Wow, that looks incredible. It looks like you can pretty much refine your query line by line and see the intermediate results to know what you're working with.


Love PRQL :D


Several decades ago while using Oracle and SQL*Plus, I used to generate the column list for queries from the data dictionary table (ALL_TAB_COLUMNS). Once I learnt that trick, I never typed the column list ever again. Eventually, I had a library of queries for common tasks. You could use that trick with almost all database engines.

IDEs were never a favorite (they were quite limited then).

Today, the situation is a lot better.


Configure your IDE, to help you write SELECT queries. It should be easy to insert a template with jump points in the desired order e.g., a generic YASnippet in Emacs should be able to do it.

Thus by the time you are writing the column expression after the SELECT, the FROM table expression would be filled already (enabling scope-dependent column completion).


FWIW in LINQ, Select comes after From and Where.


Yeah, Select comes last, and it makes the most sense.


Also in Kusto Query Language (KQL), which is used extensively in Azure.


I don't know why my memory jogged to the old ruby on rails screencasts but I suppose there's nothing stopping IDEs from jumping to the FROM section of a query and then returning you back to the SELECT in some sort of "macro"/snippet. It's a hack, I guess.


That’s exactly what Jetbrain’s Datagrip does.


From what I've seen, it will initially populate autocomplete column name values from all the introspected tables on the current schema search_path, even before the FROM clause is added. Usually this is good enough for me, so I don't think about it. I mostly interact with postgres.


Just a few years ago Javascript was pretty awful, so people started writing transpilers (most famously Babel) to add features like arrow functions, async/await, decorators, constants, etc. Typescript is another transpiler that greatly improves the language but outputs regular old, awful javascript. Many of these features even ended up being built into the language.

I wonder if it’s time for a transpiler renaissance in SQL? It would be trivial to convert “select id from users” into “from users select id”. I’m sure theres a lot of other cool feautures/sugar you could add while still outputting valid SQL.


Maybe we should treat SQL like JavaScript, and use it as a compiler target instead of coding in it directly. /s


Not sure why you added /s; .. SQL is in many ways exactly lile JavaScript. The only way to run your code where it needs to run, but with some things to be desired language wise.

I wish more languages compiling to SQL were more common.

(Not ORMs though they just miss the point entirely..)


I mean, that's pretty much what ORMs do, right? Hasura et al too.


Yes, this is true. Although my experience with ORMs is that there is always a reason to use some kind of escape hatch to run raw SQL directly.


Agreed. After spending the better part of a decade watching Hibernate a) fail to scale or b) get confused beyond a third level of association, I find myself writing bare SQL and/or refactoring to implement the repository model. ORMs are great for prototyping or if your data model is simple; beyond that you're better off without them.


Most higher level regular programming languages will allow you to escape to C, and C allows you to escape to assembly, so that's not unusual.

In my experience when you need to escape to raw SQL there is still some kind of query builder employed, which too becomes another "compile to SQL" source.


I know you're being sarcastic, but yes!


Agreed.

I would also love to be able to specify the WHERE just after the FROM but before the JOINs. And also after them: - the first one to filter the rows before joining - the second one to filter after joining


You could improve type ahead by just pouring effort into the IDE. Especially with the amount of resources we have nowadays. Easy enough to have basic typeahead on basically all possible columns when writing the select, and then you could use the columns as a filter on the tables during auto complete.

In general, this isn't done. But I don't see any technical reason it can't be done.


This strikes me as solving the wrong problem. It's not as useful to be able to derive a table from selected columns -- the desire is to complete columns from a table. It's both more intuitive and helps the user more than once.


I disagree. Folks often know what column they want, and have to find the table that best gives it.


If you have multiple tables, how do you avoid suggesting a wrong column, before filling out the table name?

You could suggest all columns up front, and then afterwards tell the user “this suggestion doesn’t exist”, which would just erode trust in the autocomplete.


Start typing “SEL”, ide suggests “SELECT _ FROM”, you press enter and the above text is entered, while the cursor is placed after from. You write that part of the query and after pressing enter it will jump back to the select part.

This is done already by Jetbrain’s datagrip for example.


Suggest all. And with modern UIs, you can hint the table with the column.


If you don’t wrap your UPDATEs in a transaction you are doing it wrong.


You can just put SELECT * FROM stuff and then edit your columns later.


We have transactions to prevent mishaps, no?


"Despite the improvements listed above, in a 2014 interview, CJ Date said “we didn’t realize how truly awful SQL was or would turn out to be (note that it’s much worse now than it was then, though it was pretty bad right from the outset).” This quote leaves me wondering – if Date himself were to write an updated critique, what would it look like? My best guess is most of his criticism would revolve around further departures of SQL from the relational model, but specific examples escape me."

I am not sure how the author has missed the copious amount of material that Date (& Darwin) wrote on this topic? The obvious one being the Third Manifesto (whole book available here [1] since it's out of print now). But even "Database in Depth" [2] has extensive discussion about this despite being a less "opinionated" and "ranty" book (highly recommended BTW)

Yes, not explicitly updated since 2014, but there's really nothing new to say since then?

Date is definitely concerned about the departures from the relational model in SQL. But it's (EDIT) not just about purism. He's also pretty in general peaved off at SQL for similar concerns that the author presents here about syntax and expressiveness. Third Manifesto's "tutorial D" mandates the addition of a richer type system, including higher level types that match on relation constraints, and "operators" (similar to OO methods) on those types; basically a kind of predicate dispatch.

I imagine he's pretty sick of being ignored.

[1] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

[2] https://www.oreilly.com/library/view/database-in-depth/05961...


I agree with everything you're saying, though I'm assuming here:

> Date is definitely concerned about the departures from the relational model in SQL. But it's just about purism.

Perhaps you meant to say "But it's not just about purism"?


Oops, yes, thanks for the catch. Editing.


In fact, Date has recently revised a whole slew of "old" papers of his, got them up-to-date wrt his current thinking, and (re-) published them. His "A critique of SQL/86" paper(s) are now chapter 8, 9 & 10 of "Database Dreaming, Vol I" (there's a Vol II too).


I've written a bajillion queries and have tons of nitpicks, but it's the twin meanings of NULL that really kills me.

NULL can be the value of a field in a record, but it is also used to indicate the lack of a record in a JOIN.

If I run:

  SELECT x.a, y.b FROM x LEFT JOIN y on x.a = y.a
and I get back

  [5, NULL]
I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

Obviously there are lots of workarounds, but to me this is a truly fundamental design flaw. SQL should have specified another NULL-like value, call it EMPTY, used only for joins that find no data. And analagous to IS NULL, it would be checked using IS EMPTY.


> but it's the twin meanings of NULL that really kills me

NULL only has one meaning: NULL. This is roughly analogous to unknown.

The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match.

> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

Not directly, but you can infer the difference going by the value of (in your example) y.a - if it is NULL then there was no match, otherwise a NULL for y.b is a NULL from the source not an indication of no match.

> SQL should have specified another NULL-like value

This sort of thing causes problems of its own. Are the unknowns equivalent? Where are they relevant? How do they affect each other? Do you need more to cover other edge cases? I have memories of VB6's four constants of the apocalypse (null, empty, missing, nothing).

This is one of the reasons some purists argue against NULL existing in SQL at all, rather than needing a family of NULL-a-likes.


> NULL only has one meaning: NULL. This is roughly analogous to unknown.

Which is what I'm arguing against, because it's used in two unrelated ways in SQL -- as a data value, and to express no matching row found in a JOIN. So no matter how it's defined formally, in practice it has two meanings that have nothing whatsoever to do with each other. One is a value and can be stored, the other says 'not found' and results only from expressions and isn't for storage.

> but you can infer the difference

Yes, as I said there are lots of workarounds. But they're still workarounds.

> This is one of the reasons some purists argue against NULL existing in SQL at all

Sure, but the reality is that NULL is used in a practical sense to mean "no data entered". It's so ridiculously common that only some columns have no data for a particular row, and so e.g. you need the 'time_finished' column to either have a valid date, or it's NULL if the activity hasn't finished yet. The alternative is to have an additional boolean column 'is_finished' and for 'time_finished' to be arbitrarily '1970-01-01T00:00:00Z' whenever is_finished is false, which is clunky and redundant.

Purists can argue what they want, but NULL is so ridiculously useful as a stored data value it doesn't really matter.


> you need the 'time_finished' column to either have a valid date, or it's NULL if the activity hasn't finished yet

Rust’s Diesel has an interesting way of dealing with this. “None” means missing while “Some(None)” means the value null. So when updating a record, “None” makes no change while “Some(None)” sets the value to null.

This distinction is obviously lost when retrieving from the database, but it’s an interesting concept.


Javascript having both `null` and `undefined` doesn't seem so crazy now


But in classic Javascript style, it's half-assed. Being able to assign anything to undefined makes the distinction between null and undefined pointless.


If you have a business need to represent "empty" or "n/a" or "declined to answer" or something like that, use a specific value for that. NULL does not mean anything. Or, it means nothing. It's just NULL. Once I got that into my head, SQL became less frustrating.


You can't without defining yet another field, usually. It's really annoying to double the number of columns so that [time_started, time_finished] becomes [time_started_exists, time_started, time_finished_exists, time_finished].

NULL values makes business logic far more compact and intuitive. For enumerated values in fields it's easy enough to define another value in the same field to mean 'unknown' or 'not entered', but you can't do that for strings, numbers, datetimes, etc. -- you have to throw in a bunch of unwieldy additional boolean fields instead.


The underlying problem here is that SQL lacks Sum Types (aka Tagged Unions). Such types solve all these problems effortlessly.

In contrast to what SQL has, Sum Types combined with Product Types (which is basically what a row is) are actually a universal way to model all possible data[0]. (Of course you may want syntax sugar, etc. on top of that, but Sums and Products at the bottom is sufficient.)

[0] I'm actually not sure if I should qualify that -- I believe Sums+Products can actually model anything, assuming you allow recursive type definitions -- which might be hard to make perform well. Storing a linked list in a database field, e.g. might not be the best idea.


Indeed, the issue of non-existence is a somewhat tricky philosophical question (e.g. there are many ways something may not exist).


A better option in many cases is to check the primary key.

e.g.

select questions.Id , questions.Text , answers.Id , answers.Text

from questions left join answers on answers.QuestionId = Questions.Id

answers.Id is non-nullable as a primary key, so if answers.Text is null but answers.Id is not null they've declined to answer.


That may be implementation or circumstance specific. For instance in MS SQL Server with a heap table (one without a clustered index) or a table where the primary key is not the clustering key, it will result in extra page reads to check the other field's value (the query planner / engine could infer from it being the PK that it can never be null, so the lookup to check is unnecessary, but IIRC it does not do this). As the columns used in the join predicate have to be read to perform the join, no extra reads will result from using them for other filtering.

In your example it is very likely that the primary key is the clustering key, so will be present in the non-clustered index that I assume will be on answers.questionId, making my point moot, but if for some unusual reason neither Id nor questionId were the clustering key checking Id may result in extra reads being needed.

In DBMSs without clustering keys implemented similarly to SQL Server, there may be such concerns in all cases.


> This is one of the reasons some purists argue against NULL existing in SQL at all, rather than needing a family of NULL-a-likes.

So then are there no optional values? What happens when OUTER JOINs don't match?


I assume NULL would still exist there, but there would be no explicit NULL values permitted in tuples (rows) that actually exist.

Or perhaps the purists would remove outer joins too, it isn't since University that I've read around the discussion, but given alternate syntax to do the same thing can sometimes be convoluted that might be a bad idea itself.


In the relational model, no there is no such thing as "optional values".

Even Codd often hinted (quite a bit of material in the 1990 book gives such hints) that "if a row has a null somewhere, it SHOULDN'T BE CONSIDERED AS BELONGING TO THE RELATION". E.g. if an attribute of a FK is null, no FK checking should be done at all on account of that null. The FK checking should be done only when the FK value is "complete".

As for outer join, one problem with it is that its result is (intended to be) the extension of a disjuctive predicate, the disjuncts in question being exactly the predicate for the "matching" case and the other one the predicate for the "no match" case. Now if you see a row in the result, how can you tell from that row which of the two disjuncts made that row appear in the result ?

Outer join is problematic by definition precisely because of the possibly ambiguous interpretation that derives from its disjunctive predicate.

(Would you do that with base tables ??? I mean design a base table that can hold a row for strictly more than one possible reason ??? I mean where the user then sees a row in that table and subsequently cannot tell whether that row is there for reason/meaning 1 or for reason/meaning 2 ??? I'll tell what you -and anyone else- would do in such a case : you'd also add the indicator telling whether it's meaning 1 or 2 (and then the problem of 'what if it's both' is also solved). Now think back about what could be done in outer join to resolve that very same kind of disjunctive ambiguity.)


You don’t do outer joins.


Don’t these exist for a reason?


Yes. The reason is lazy people expect to be able to ask five questions in one go and expect one single answer to answer all five of them.

The key is to realize that submitting a query is to ask a question. So how many queries to you think you have to issue if the number of questions you have is five ?

It is technically perfectly possible to "ask five questions at the same time" to any given truly relational DBMS. Said DBMS will give five answers at the same time (and not leave you to guess which answer corresponds to which question). That SQL doesn't do it, does not mean it's impossible.


Of course.

You've got a list of countries and are pulling each country's national flower, national bird, largest port city etc.

Without outer joins, Liechtenstein with no ports doesn't show in the list at all. Sad news for people who want to know all countries, or Liechtenstein's national bird (eagle).


You’re not obliged to pull everything in one request. You can issue several requests.


Sure. You're also not obligated to include a WHERE-clause in the query you send to your database. You can do the filtering in the application.


What's your point? You can live without nulls, they're not required and you don't need to avoid WHERE-clause for that. It's a theoretical concept, nobody in sane mind would do that, but nulls are not required for relational algebra.


You started this subthread by asserting that one shouldn't use outer joins, and when other people pointed out valid use cases for outer joins, you advocated to use application logic to work around your refusal to use them.

My response was just to point out the absurdity of your premise.


"NULL only has one meaning: NULL. This is roughly analogous to unknown."

From the re-published version of "Much ado about nothing, part 2" (re-published in "Database Dreaming, Vol II) :

"An outer join produces nulls because we ask the DBMS to show us data from one table even when no matching data can be found in another table. Such nulls have no meaning, no semantic content."

Personally I'd add to that : and even if it's the case that they do, that meaning is patently not "simply unknown" but rather a very clear indication of non-existence.

But I have no doubt the apologists will obviously always apologize no matter what and handwave arguments such as these away with "So what ? What's the difference." Well, it's the difference between knowing of non-existence and not knowing at all.


"The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match."

Sorry, but this is FALSE.

"Even if one of the other values match" will make that particular INequality test FALSE and one FALSE conjunct will make the entire conjunction FALSE, even in 3VL.


NULL itself is open to interpretation.

If I CREATE TABLE foo (bar CHAR(1) PRIMARY KEY, baz char(1) UNIQUE), then different things happen on different databases.

In Microsoft SQL server, only one insert of a null into the baz column is allowed, and the null value is indexed.

In Oracle, null is never indexed in this context, so any number of null insertions into baz are allowed.

On a composite index, I believe that nulls are always indexed, on any database (they must be).

As far as I understand it, this is implementation-defined:

  SQL> select * from dual where null=null;

  no rows selected


WHERE NULL=NULL should never return true, so no rows returned is correct. It is only implementation defined in that some implementations get this wrong!¹³

SQL Server's handling of NULLs with in unique indexes/constraints is not standard. You can get better using a filtered unique index (“CREATE UNIQUE INDEX more_compliant_unique_index ON dbo.some_table(this_column) WHERE this_column IS NOT NULL;”), but be aware that filtered indexes have their own collection of gotchas⁴.

----

[1] MS Access IIRC, MS SQL Server if the ANSI_NULLS option set to off, MS² SSIS's default (consider NULLS equal) behaviour for merge steps, and so on.

[2] You may be noticing a pattern here… MS are not the only culprit though.

[3] I'm making a distinction here between places where behaviour is not defined in the standards, so implementations are free to go their own way, and where implementations go against what is expected by the standards.

[4] particularly with respect to the collection of unfixed/wontfix gnarly edge cases that is the MERGE statement.


A nullable column can always be extracted to a separate table with a non-nullable column and a foreign key. If you left-join this back with the base table, you will get the nulls again for the missing values. So to me it seem nicely symmetrical to use the same kind of NULL value.


I'd check to see if y.a IS NULL in that situation. I'm sure there are cases where it matters, but most of the time for me the difference between "there's a row in y, but the value is NULL" and "there's no row in y, the value is NULL" is irrelevant. I can't think of a time when that distinction has been important, and I'm working on a project converting thousands of complex SQL queries.

The thing that really bugs me about NULL is the default assumptions - 99.9% of the time I want NULL in a string context to be the empty string and NULL in a numeric context to be 0, but I have to use ISNULL or COALESCE to get that. I wish it were the other way round where NULL is treated as '' or 0 by default, but I can do something special if I really want to test for NULL'ness.


Interesting, we must work with really different data/queries.

I've come up against the distinction lots of times (and yes have to retrieve additional fields in order to address it), while I don't ever want to confuse 0 with NULL. Tons of things are legitimately zero but crucially non-null, like an inventory count. (Empty strings, on the other hand, do seem much more interchangeable with NULL in probably the vast majority of contexts.)


'' is NULL in Oracle. People complain about it, and it always seemed a little weird to me.


As another commenter said, if your table have required ID columns, you should specify it in your SELECT.

     SELECT x.Id,x.a,y.ID,y.b 
     FROM x 
     LEFT JOIN y 
       on x.a = y.a    
The output [Id12345,5,Id6789,NULL] means the column is null. The output [Id12345,5,NULL,NULL] means the row is empty.

On a side not, I use the SQL below all the time to identify missing rows on the "comparetable".

  SELECT #basetable.columnname
  FROM #basetable
  LEFT OUTER JOIN #comparetable
  ON #basetable.columnname1 = #comparetable.columnname1
    and #basetable.columnname2 = #comparetable.columnname2
    and #basetable.columnname3 = #comparetable.columnname3
  WHERE #comparetable.columnname1 is null


Fabian Pascal made a whole career out of writing about how stupid "NULL" is in SQL. See if you can find his writings some time. He was positively livid about it.

Notably NULL of any kind is completely absent from the relational algebra. There are other ways to express absence. Notably, this is all supposed to be first-order predicate logic <hand wave hand wave> stuff, and... in that world "null" is also not a "thing".

Also SQL insists on allowing duplicate "rows" in its "tables"; whereas the relational model is very clear that the contents of a relation ("table") is a set, not a bag.

These two confusions in SQL actually lead to many many data integrity problems as follow-on effects and also complicate query planning and optimization.


> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

You can always add SELECT y.a, which will allow you to disambiguate between the two options (it will be non-NULL in the first case and NULL in the second).


You need one more output column. The specifics will vary by RDBMS because SQL isn't that standard. The most portable thing would be to do this:

  SELECT x.a, y.a IS NOT NULL, y.b
  FROM x
  LEFT JOIN y ON x.a = y.a;
assuming y.a is a NOT NULL column, or a PRIMARY KEY column, since PKs are supposed to be non-nullable.

In PG you could:

  SELECT x.a, y IS DISTINCT FROM NULL, y.b
  FROM x
  LEFT JOIN y ON x.a = y.a;


The predicate corresponding to an outer join is fundamentally disjunctive in nature (any outer join is equivalent to a UNION of at least 2 SELECTs) meaning that "after the fact", there is no way to determine which particular one of the disjuncts made the row appear in the result.


Why does it kill you? A NULL in a row is semantically identical to the row never existing in the first place, unless there is some other non-null column (the primary key) that lets you disambiguate between the two cases. It's hardly a problem in practice because all good tables should have primary keys.


This is why JS has both null and undefined.


Ah, JS, the pinnacle of data correctness. 1 == "1" results in undefined, right?


nope 1 == "1" is true and 1 === "1" is false


How do you expect `y.b` to behave on EMPTY rows? Should it return EMPTY or NULL?


you should not need to distinguish between empty and null because they mean the same thing. if you want to select records from x that aren't in y you should use an anti-join (where not exists, not in, etc.)


This is not true--especially when considering different SQL implementations (e.g. Oracle SQL versus Microsoft SQL). NULL and EMPTY handle the intersection of ontic versus epistemic claims.

EMPTY implies a known, 0-byte value whereas NULL can imply either an unknown value or the "unknowability" of a value (i.e. the in-existence of a value).

In practical terms, this would be like equating the statements "I don't know whether that dog has a name" (i.e. a NULL name) and "I do not know the name of that dog" (an EMPTY name). The former does not assert the existence of a proper noun to represent "that dog", whereas the latter implicitly asserts that there exists a proper noun which nominates "that dog".


beg pardon I wasn't aware sql actually defined empty. of the engines I'm familiar with there is no keyword for empty like what gp is asking for (and I hold my argument for why it should not be necessary). it sounds like what you're describing is a blank string and I agree with using blanks to distinguish between "has no phone number" and "we don't know their phone number" but that isn't what gp was talking about.


Also you should select y.a to know wether the y-record exists.


Good point, it would be nice to have both NULL and EMPTY to split these meanings.


To piggy-back, it bothers me so much that this is valid syntax in many implementations:

UPDATE x SET a = NULL WHERE b IS NULL;

Like... wat?


Why are you joining on a nullable column in the first place?

If your database is well designed, joining on a nullable column should be a relatively exotic use case, and can be handled by writing a tiny bit more code to check for NULLs before performing your join.


It's not.

Assume y.a is the non-nullable primary key, while y.b is the column that may be null.

But in this example there might not be a row where y.a = 5. Or there might be. But you can't tell.


But in those cases, you can always tell by including the non-nullable key in your result set. It's already being evaluated, so it's virtually free (only adds to the network transport size).


The only thing I would really blame solely on SQL is that UPDATE and DELETE statements don't require you to specify a limit.

I have seen many times in my career where a rogue delete just truncates a table, a simple statement of intent (e.g. LIMIT 1) would tell the query planner that if it is about update/delete more than 1 row, it should error. In fact MySQL actually returns a warning if you do this.

TRUNCATE clearly states your intention to delete everything, but DELETE by default also deletes everything.

There were also definitely some bad paradigms invented as a result of SQL (e.g. all business logic lives in the database as stored procedures or stored functions), but that was bound to happen whenever a shiny new tool comes out and engineers want to use it.


> TRUNCATE clearly states your intention to delete everything, but DELETE by default also deletes everything.

I think the other way around.

“DELETE Something” to me means get rid of the whole of Something. “TRUNCATE Something” (in general English, not SQL) to me means to make something shorter, which probably doesn't mean removing it completely.

Perhaps “EMPTY Table” would have been a better choice than “TRUNCATE Table”, but by the time it reached the standard¹ TRUNCATE had long since been picked and until then it was not the responsibility of the SQL standard: you only had DELETE and without a filter it makes sence that this deletes all.

----

[1] note that TRUNCATE was not in the SQL standards until SQL2008, long after it was common in various DMBSs as a minimally logged alternative to DELETE-without-filtering-clauses.


> TRUNCATE clearly states your intention to delete everything, but DELETE by default also deletes everything.

btw. they are completly different.

TRUNCATE basically ignores all transaction semantices for the sake of performance (which can be really really bad) in mysql it can't be rolled back, in postgres it's not mvcc safe. TRUNCATE most often uses either an exclusive lock or some other mechanism.

if you know what you are doing truncate can be helpful, but delete has its benefits as well and truncate should be avoided by beginners.


The other important factor is permissions.

Truncate in MSSQL requires alter permissions, so a user that can wipe a table can also change its definition. Not true with delete.


"There were also definitely some bad paradigms invented as a result of SQL (e.g. all business logic lives in the database as stored procedures or stored functions),"

I hope you don't mean by that that integrity enforcement (checking the compliance of the data with ALL declared business rules) does not belong in the DBMS. Because the exact opposite is true : inside the DBMS is the only place where it belongs. Or if not the only, then certainly still the primary.


Require?

Some UPDATEs don't allow me to specify a limit!

Infuriating stuff.


So don’t COMMIT your UPDATEs.


While all the criticism is probably correct, and SQL definitely shows its age, it's still very much good enough for pretty much all its current applications. Also, SQL's basics are very easy to learn. These aspects make it very hard to imagine a language that might gain enough traction to actually replace SQL in a foreseeable future.


Many have tried. Most are hobby projects. None have had the full force of a real production ready DBMS behind them.

And really, projects like this are fighting the general ignorance the industry as a whole has about what a relational database actually is. It's better since the NoSQL wave crested and we stopped hearing stupid shit like "my data isn't tabular and doesn't fit in a schema", but there's still a preponderance of people who haven't stopped to learn the lessons that Date and Codd tried to teach decades ago, and constantly reach for graph/network/hierarchical databases without understanding why we originally tried to move beyond those back in the 70s.

Unfortunately because of this most attempts to do "better than SQL" end up not working from first principles and look like a dogs breakfast of ill-formed "that'd be neat" ideas.

And, yeah, as I said the "pure relational" projects that are out there are usually either academic projects or hobby. (Yes I've had a few of my own)

I think there might be hope in the Datalog-type systems that have emerged recently.


:-)

And those that are built on "first principles" end up looking like a dog's breakfast just as well because people expect to not have to learn a novel language.

People don't want to be told to forget everything they thought they knew. People don"t want to be told they've been wrong for 40 yrs. Even if you add that it's not entirely their own fault because they've been consistently misled by an entire industry.

You can crack the problem of supporting CREATE ASSERTION, they'll still dismiss it (handwave handwave) believing you're just bluffing.

And I mostly disagree with the optimism built on Datalog.


Any alternative to SQL has to transpile to SQL in order to gain traction.


Which right away rules out a whole bunch of more sophisticated and elegant behaviours, honestly.

The other alternative might be to implement one's new thing as a patch to alter the frontend of Postgres. I looked at this many years ago and the engineering effort was immense. But it might be easier now.


Why does it rule out 'more sophisticated and elegant behaviours'?


If your language has to interoperate with data in the SQL database, it right away has to deal with SQL's limitations and oddities:

1) SQL's type model has no concept of nested relations or rich datatypes; it's limited to basic scalar types <tho there are vendor additions to this>, so implementing e.g. Date's concept of "relation valued attributes" <essentially tables-in-tables> for example would be impractical

2) SQL's storage model doesn't map 1:1 with e.g. a relational model because SQL tables are bags instead of sets (duplicate rows are possible). This has a couple downsides. a) whatever more expressive and relational algebraic language you built would have to inherit that model and and have some way of dealing with bags instead of proper relational ranges <which are sets> and b) the presence of the possibility of dupes actually makes some query planner optimizations impossible to perform. <CJ Date goes into more depth explaining this, I don't have room for it here>

3) Likewise similar problems with SQL NULLs. NULLs are not in the "pure" relational model, so your language can't be pure. NULLs mess with the type system, and you end up inheriting a bunch of the weird casting/identity/comparison issues that come with NULLS. And NULLs, again mess with the query planner.

Finally, the execution model of SQL is always against a DBMS table or view. So in-memory relations, etc. look difficult to implement.


Thanks for listing these. It's very useful.

(1) is a function of what SQL you might transpile to. Any SQL w/ rich types (like PG) or w/ JSON support (most others) will be able to meet (1) with varying levels of hackiness.

(2) even if some SQL RDBMS allows you to have tables w/o primary keys, the thing transpiling to SQL wouldn't be obligated to have tables w/o primary keys. Some SQL RDBMSes do not allow tables to lack PKs (e.g., SQLite3 doesn't), but even this wouldn't be a problem -if one wanted tables-with-dups- because one could use hidden row IDs to make up for that.

(3) just as with (2), the new thing transpiling to SQL wouldn't have to allow NULLs, and could generate SQL that uses `IS [NOT] NULL` / `IS [NOT] DISTINCT FROM NULL` to obtain the desired semantics if the new thing kept any notion of NULL.

(4) is not a semantics issue, and anyways, some SQL RDBMSes let you implement tables yourself, and for the others there's CTEs and TEMP tables.

Of these only (1) is a problem if you want to target pure standard SQL, though it depends on which SQL standard you target.

We're definitely not talking about transpiling to 1980s SQL!

So, while there might some semantic impedance mismatches for a transpiling approach, and the choice of RDBMS to transpile to matters, on the whole I'm unconvinced that there are such problems that are very serious. My own imagination is failing me in my attempts to come with such semantic impedance mismatches.


Interesting. Ok

> (3) just as with (2), the new thing transpiling to SQL wouldn't have to allow NULLs, and could generate SQL that uses `IS [NOT] NULL` / `IS [NOT] DISTINCT FROM NULL` to obtain the desired semantics if the new thing kept any notion of NULL.

You may disallow nulls stored, but anything like a non-inner join will produce nulls and then you're almost back to where you started.

> (4) is not a semantics issue, and anyways, [where 4 is "Finally, the execution model of SQL is always against a DBMS table or view. So in-memory relations, etc. look difficult to implement"].

Spot on and great catch. I missed that.

> I'm unconvinced that there are such problems that are very serious

Let me help you - simple sqlNG commands could transpile into more much complex normal sql target statements, breaking the optimiser. That's a very real risk.

Good post though


> Let me help you - simple sqlNG commands could transpile into more much complex normal sql target statements, breaking the optimiser. That's a very real risk.

Breaking the optimizer is a big deal, yes. I have this fantasy that we could have an out-of-band hinting system for SQL, like "start the query plan with table source <alias>", or "use index <index-name> for table source <alias>", or "materialize table source <alias> and create these indices for it", or "index the CTEs by..." (the last one should really be automatically done by the optimizer, but when you can save it thinking time, that can be a good thing). In-band hinting has generally been problematic. Out-of-band hinting could either follow the query itself, or be provided in a SQL exec API as a separate argument, or as explicit method calls in query setup in LINKQ-like systems.

> You may disallow nulls stored, but anything like a non-inner join will produce nulls and then you're almost back to where you started.

Now that just points out that the problem arises naturally and isn't entirely SQL's fault. Null values are as much a problem for SQL as for C and many other languages.

The way modern programming languages deal with nulls is by insisting on full pattern matching on algebraic data types. So it follows that one might want to do the same in a new query language, and/or possibly as an extension to SQL. So that this would be OK because at the top-level, but not as a subquery:

  SELECT a.name, b.thing
  FROM a LEFT JOIN b USING (id)
and one would be forced to check for absence (like nullity, but with the requirement that one must check every time it's possible that a value is absent). In SQL we have `coalesce()`, `IS [NOT] NULL`, `IS [NOT] DISTINCT FROM NULL`, and so on, so we can do this pattern matching, but we're not required to, and that might be the real problem with SQL and NULL values.

> Good post though

Ditto.


This is getting even more interesting.

> I have this fantasy that we could have an out-of-band hinting system for SQL,

I think not breaking the optimiser, and out of band hinting appear somewhat orthogonal but I'd be very glad to be wrong. I'm looking at OOB hinting for a different reason, but what you're thinking is very close to mine, and I'm struck by the fact you see hinting the same way I do (actual, and extensive, logical vs physical separation, inc. indexes as hints).

> and one would be forced to check for [nulls]

Even more interesting. I will shortly have something of relevance. Do you have a contact e-mail?


> I think not breaking the optimiser, and out of band hinting appear somewhat orthogonal but I'd be very glad to be wrong.

If you're writing a transpiler, you might as well make it an optimizer and use a solid hinting system to guide the target RDBMS' optimizer so as not to be at its mercy. If you were implementing from scratch rather than as a transpiler, you'd have to write an optimizer anyways.

Also, the reason optimizers can suck is precisely that there can be many ways to plan a query.

E.g., if you have 10 table sources INNER JOINed in sequence, you could start the query plan with any one of them, but not every one of those 10 tables will yield a good query plan when used as the starting table! You want to start such a query plan with the table that has a PK or index key that is most fully determined by the WHERE/ON constraints, by which I mean: the table source that will have the smallest cardinality given the constraints imposed on it by the query.

Sometimes the query that one has to optimize is a sub-query deeply buried in another, possibly in a VIEW, and the hinting needed may be specific to the context outside the view. Now inline hinting leads to having to inline sub-queries (repetition).

The many table sources INNER JOINed case is one I've actually run into.

Part of the problem seems to be that pushing constraints is a fairly advanced sort of query algebra: you need a good AST in order to be able to do it, and even then, it's not all you have to do to build a decent optimizer. If you look at the SQLite3 optimizer, it's gotten much better over the years at pushing constraints.

With every table source named, you can address hints about them, and you can do it out of band.

That or pick a target whose optimizer you can either work on or already is good enough for the transpiler's purposes. However, keeping things general is a good idea, especially when it comes to SQL (there's so much incompatible variation between SQLs!), so as to keep the transpiler generic and portable. The lack of a standard and solid hinting system is a real problem. Also, sometimes optimizers get worse.

> Do you have a contact e-mail?

  ${HN_username} @ gmail


SQL (the language that is, as defined by ISO) has COLLECT and UNNEST operators which come reasonably close (not 100% of course) to "Date's concept of relation-valued attributes". Moreover, I'm told those two operators are supported by at least one of the three big dogs.


1) yeah, probably.

2) Supose you're writing sqlng, compiling it to sql. sqlng simply mandates a primary key on everything. Not a problem.

3) nulls can be a horror but the alternatives aren't pretty. Removing nulls means tables have to be broken up, see https://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without.... I believe there are better ways to handle this issue.

(final line) Makes no sense, a tab;e doesn't have to be disk-resident. Once read, with sufficient memory, the file sits in ram cache.


I second u/zasdffaa's question. What behaviors does this eliminate? I'm sure it eliminates some, but I want to hear which ones you're interested in.


Agreed - my experience with non-SQL tech (ORMs) with Django and ASP.net MVC made me really appreciate SQL so much. Most of the time I felt everything would be so much easier using raw SQL instead of dealing with model objects. It also felt like in their quest to replace SQL and make things "simpler" they were recreating some db features again.


It can depend on your perspective. If you look at an app and just see tables and data then yes, model objects stand in a way. If you describe your app as this thing that manipulates the database then I can see how a layer of abstraction can be annoying. But people write 'billing apps' not 'data manipulator apps', so there is this pesky business logic. ORM helps you separate business logic from data access logic. If you don't need this separation then it just stands in the way. ORM is not perfect, but most abstractions leak to some extent.

I think it also depends on how the 'model objects' are implemented. Is it Active Record or Domain Model? I think most of the complaints about ORM are actually complaints about Active Record / DAO, or just a crappy implementation. Also nothing wrong with using SQL in an otherwise ORMed application. Just harder to unit test, refactor etc.


Yep. That's why I write most of my logic in stored procedures. Working with tables and queries is so much easier in PL/pgsql than dealing with ORMs and their leaky abstractions.

My application code just calls stored procedures. It's unaware of the tables and underlying data model.


There is no one-size-fits-all, but most of the time I would be against SP because:

1) SPs usually mix persistence concerns with business logic. Making it harder to understand business intent. I find objects much more expressive than raw data. Sometimes you want to concentrate on the plain logic, without worrying about how something gets saved. Also you will not have to rewrite everything if you ever want to change how something is stored. Sometimes people switch from mysql to pg, or even doc or kv database. Keeping business logic separate enables this. And good ORM enables this separation.

2) Refactoring tooling and unit tests. SPs are lacking here significantly compared to general purpose languages.

3) Business logic outside of DB allows easier horizontal scaling.


In my experience, the application framework changes much more often than the database.

PHP, ASP, ASP.NET, Angular, React, Django, whatever. Pick a year, pick a framework. The database stays steady.


I agree. But I found that approaching it _as_if_ database will change makes for a more focused model. Same goes for UI frameworks. Basically thriving toward Hexagonal architecture, without being too dogmatic.


Cool! I do this, but I haven't seen anyone else do it. Is any of your code public?

I wrote about it at https://sive.rs/pg

and posted my SQL shopping cart at https://github.com/sivers/store

Please contact me if you'd like to share tips: https://sive.rs/contact


No I don't really have any public code.

I read your blog post and I agree 100%

The database is the easiest place to code business logic that pertains to the data. Write it once, it's available for all client applications.


COBOL also fits this description, down to the awkward attempts at matching natural language.

I think the staying power of SQL comes from its broader audience - programmers, analysts, and executives all use it. It's much easier to motivate programmers to learn a new, superior language than it is to motivate executives to learn a new technology that only gives ergonomic improvements.


Many of those complaints seem theoretical. I like to focus on practical concerns. The biggest problem I see is that the SQL language has grown too complex. It's related to the "Lack of Orthogonality" problem mentioned in the article, but I see different solutions. SQL is not based on combinations of simpler concepts, but hard-coded keywords. But how to orthogonize (factor) it gets into philosophical differences. My favorite alternative is an experimental language called SMEQL (Structured Meta-Enabled Query Language):

https://wiki.c2.com/?TqlRoadmap

It's more API-like to reduce the need for specialized keywords. And you can "calculate" column lists via queries instead of have to list columns. For example, if a table has 60 columns and you want to SELECT all 60 minus two columns, you can't without listing all 58. With SMEQL you can use a query to return a list (table) of those 58 and feed it to the equivalent of a SELECT clause.

Things like CREATE TABLE are fed a table if you desire so you can "compute" your schema. You can generate entire schemas from data dictionaries. Anything that can be done with a table is. You can create in-line (virtual) tables if you want it command-driven, but it's still "using tables to do everything". You can use textual code to create such virtual tables or a table editor as it fits the situation. SMEQL is as close to "everything is a table" as I've seen. Even your FROM list can be table-ized. I used to do similar with dBASE/xBASE, and it was really nice, especially for ad-hoc work such as one-off research requests.

And as somebody mentioned here, null handling needs a serious revisit in SQL. I could rant all day about SQL null handling, especially for strings.


"Many of those complaints seem theoretical. I like to focus on practical concerns."

The only reason ever why people engage into theory, is precisely because of a deeply rooted desire to address "practical concerns". Hawking literally stated that his aim was to understand everything about the entire universe.

People who "like to focus on practical concerns" see an apple fall and (after they've seen it happen often enough and never ever otherwise) ultimately conclude that "apples fall and that's just how it is". Newton saw an apple fall and ultimately unveiled the concept of gravity. In doing so, Newton also unveiled the (admittedly extremely exceptional) circumstances in which it might be possible for us to see an apple "fall upward into the skies" after all.


Many "overly educated" (for lack of nice way to say it) just seem to often waste time on idealistic issues. It would take a while to give a decent example, but they just seem to not understand the business world and capitalism. The Newton example is not applicable because it's usually not about discovery, but weighing tradeoffs. I don't mind their personal R&D, just don't make the customer pay for or wait too long for the R&D. After work, they can invent all they want.


I agree!!! this is my pet peeve as well, and I sometimes fantasize about ripping into PostgreSQL and adding column-minus.

One wrinkle: computed columns would interfere with query optimization. That said (and here I speak heresy) there are times when syntactic convenience trumps performance.


Meta ability (dynamism) will generally have an average performance penalty. If you want speed, you hard-code more up front. Meta-ability generally is best for ad-hoc and short-term uses.

One performance trick is for the query optimizer assume the calculation will be the same as the last time the same query or sub-query was issued, and simply dump the results and start over if that assumption is false. If it keeps having to dump, then it assumes you are editing a lot and stops guessing for a day or so. (You could issue an optimizer command to resume guessing if need be.)


The syntax could be "* (minus, columns, here)".


PRQL (Pipelined Relational Query Language) to the rescue!

https://prql-lang.org


> "we didn’t realize how truly awful SQL was or would turn out to be (note that it’s much worse now than it was then, though it was pretty bad right from the outset)."

I wish the "truly awful" stuff I come up with was 0.1% as successful as SQL.


And Tony Hoare probably wished he hadn't made his "billion dollar mistake".

But at any rate, you could also try and ponder how unbelievably exceptionally awful it must have been that existed before SQL if something as truly awful as SQL could still be as successful as it has been.


I once read an article about SQL and how reordering the sections of a query would make it more ergonomic for users (iirc things like specifying what you want first and then how to present it last, more like a pipeline). I searched many times over the years but have not been able to find it again.



Perhaps? I remember the article arguing more comprehensively why a different order would be beneficial, but maybe I am just misremembering


I always appreciate blog posts like this, there are obviously cases where SQL shines, and in part I think the dataframe abstraction helps with filling a lot of the missing pieces that SQL doesn't handle so well (composability, debuggability, interactivity, etc.)

Even pandas (with all its faults) is more flexible as a language than SQL[1]. I'm of the opinion that there's a better together story in the end, but I guess we will see.

[1] https://ponder.io/pandas-vs-sql-food-court-michelin-style-re...


qSQL based on the concepts of ordered lists is more appropriate for many queries, examples available here: https://www.timestored.com/b/kdb-qsql-query-vs-sql/

Kdb the system that qSQL is ran within, allows full use of variables and all builtin functions with tables/functions/variable/columns. It really is a case of less is more. What this allows is functional form queries. Imagine being able to query: ?[tableName;lessThan(age;10)] and have perfect functional form representation for all queries. No ORM, no string concatenation. It seems some other database creators are at least becoming area of these things and integrating parts.


I really never want to assume order in tables. That's because there could be many orders of interest, and that's one reason to have multiple indexes. And also because for a table, the order of rows can be hard to guarantee. Obviously something like qSQL would try hard to guarantee table row order, and that's great in the cases where qSQL is useful, but more generally it's going to impose on how you work with the database.

On the other hand, SQL is set-based. But being set-based is weird in the world of computers because in memory everything is ordered. Sets are a fiction in programming, as they're always ordered in some way, and the set abstraction can only try to hide that order.

And that order can be very useful.

But there can be many orders that can be useful, but only one in which things are stored in memory -- the others can only be extra indexes.

So the general purpose thing (SQL) has to be set-based, offering explicit ordering, and taking advantage of actual order for optimization.


SQL is having somewhat of a moment in the bigdata world, thanks in part to 'modern datastack' and new age datawarehouses like snowflake,bigquery.

However there are a lot of pushback from 'traditional' dataengineers who were trained on spark/scala. Its bit of hardsell to go from a highly typed language to a free for all text based logic.

I think the following is needed for sql to be finally accepted as 'serious' contender.

create compiled sql language ( not pandas)

1. that compiles to sql and addresses some of the issues bought up in the post like nested aggregations.

2. make code reusable. Eg: apply year over year growth to a table that has the requisite columns. Compiler should check this in ide.

3. make materializations first class concept in the language. No seperate dbt layer.

4. crate a way to package and distribute libraries that you can import into your project .

5. a unit testing framework that makes it easy to test the logic without having to setup test tables in the database.


> I think the following is needed for sql to be finally accepted as 'serious' contender.

Whatever way you slice it, SQL is one of the most used languages today [0].

[0] https://spectrum.ieee.org/top-programming-languages-2022


right. I get lots of pushback for using sql at my clients. They just defeat me with one single point. "where are your unit tests" :D


> where are your unit tests

I do unit testing in SQL, and something I'm working on and use extensively myself (https://www.npmjs.com/package/sql-watch) indirectly supports unit tests.

There are also SQL testing frameworks available.


Yea I do that via dbt by setting up a mock data tables in database and using a macro to use those as sources/refs when run in test mode.

However what we are doing here isn't 'unit testing' its a black box integration testing. When I write equivalent code in scala, i just test the logic via unit tests,

eg: logic to filter out some orders that don't qualify, I extract method in scala code and just test that logic as part of development lifecycle. There is no dependency on a database.

Analog here is using selenium or some ui testing framework to test if button turns blue if order exceeds limit. Thats not unit testing.


> I extract method in scala code and just test that logic as part of development lifecycle. There is no dependency on a database.

Unit testing seems to depend on where the unit of code is which is being tested. At the middle tier, you may mock out parts of the code so the tests aren't reliant on external sources (apis, databases, libraries, etc.).

It seems that unit testing database code would happen at the database layer: it's still a unit test as the test isn't dependent on external sources.


> It seems that unit testing database code would happen at the database layer:

Its not database code though. "Orders over 100$ should be marked as vip" is domain logic.

Database code like interactions with database, connection pools, primary/secondary switching ect yes they should be tested with a database.


I was going to say the same. With the prevalence of embedded databases, and how cheap it is to stand up a container with non-embedded options, build time testing of queries has never been easier.


Dbt to the rescue!


I think these are great suggestions.

It seems like you're suggesting that someone could design a functional-style programming language that compiles to SQL.

2 & 3 are my biggest pain points. I can't just extract functions like I can with a regular programming language. Instead, SQL queries get increasingly complex with no great tools to manage that.

For 3, products like https://materialize.com/ look interesting for being able to create derived materialized views that can efficiently be kept up to date.


I'm not familiar with Ecto, dplyr, or DBT, but I would love an ML-like language to replace SQL. I'm imagining being able to pass a table (or any table-oriented data, like a sub-query) to functions that would type-check columns and would return table-oriented or scalar data. I'm not sure if this is actually possible in practice, but one can dream.

For instance, a "top 10" function that could be re-used on any table (apologies for my pseudo types and code):

  selectTop10 : Column -> Table -> Table
  selectTop10 orderByColumn table =
    SQL.selectAllFrom table
    |> orderDescBy orderByColumn
    |> limit 10

  limit : Int -> Table -> Table
  limit n rows =
    SQL.limitBy n rows

  orderDescBy : Column -> Table -> Table
  orderDescBy orderByColumn rows =
    SQL.orderBy [orderByColumn] SQL.Ordering.Desc rows


Not 100% sure about what you're suggesting, but wouldn't it be easier to pass your functions to your table/sub-query?

And that's exactly what you're able to do in most of the modern data warehouse services such as Snowflake. Inferences can be contained within internal/external user defined functions.

This is very reminiscent of made the big-data/map-reduce movement so notable, sending your query to the data instead of moving your data to the query. Sending your model to the data, instead of sending the data to the model.


I think I see what you're saying, and in my idea I'm suggesting the same - sending the function(s) to the data.


I don't quite follow what you are saying. In some SQL engines you can use row_number() function and derived tables (or CTE) to get top 10.

  SELECT Id,col1,col2
  FROM
  (
    SELECT 
    Id
    ,col1
    ,col2 
    ,row_number () over (partition by columnkey1, columnkey2 order by anycolumnwilldohere desc) as _row
    FROM _table
  ) as anytablealiaswilldohere
  WHERE _row <=10


My idea is you wouldn't have to cut and paste implementations like this. You would be able to just call the function with a column name and table name.


I'm currently doing work that uses Elixir's Ecto which goes a great deal towards what I think you're aiming at. I can write my SQL in a familiar, yet functional and composable style; while knowing what SQL will be produced in the end. Ecto, as I understand it, was inspired by Microsoft's/c# LINQ. I've not worked with that, but heard similar praises for that as exists with Ecto.

I'm saying this as with most of my experience being in SQL.


> It seems like you're suggesting that someone could design a functional-style programming language that compiles to SQL.

Not exactly but sort of this:

http://blog.hydromatic.net/2020/02/25/morel-a-functional-lan...


"someone could design a functional-style programming language that compiles to SQL". See e.g. R dplyr


DBT solves 2 & 3


Ibis might be an option. It has syntax similar to pandas and can compile to a number of types of sql, pyspark, or dask.

https://github.com/ibis-project/ibis


Its interesting,the author dismisses

>Mismatch with Host Language

But to me, i feel like this is the biggest problem with sql. Yes every language is different, but having different mental models nonetheless causes friction, and leads to things like ORMs which are often even worse.


I think we can do much better than SQL without losing its inherent power. Projects like Prisma and EdgeDB make me optimistic regarding the future of relational querying languages


the title doesn't match the first statement, which states a math unfact:

>A Critique of SQL, 40 Years Later 08.11.2022

>The SQL language made its first appearance in 1974, as part of IBM’s System R database. It is now over 50 years later, and SQL is the de facto language for

1974 isn't 50 years ago yet :)


The title did not refer to years passed since the introduction of SQL, but to years passed since the publication of the paper with the title "A Critique of the SQL Database Language", i.e. from 1984-11.

So the title is correct.


Title may be correct. First sentence is totally wrong. Why would I read further?


Author here. Fixed the mistake. You may now proceed with reading the remainder of the post ;)


It's off by two years. People often round 48 to 50. I don't mind TFA rounding 48 to 50; you seem to, so don't read it if you don't want to.


If it'd been 1971 and TFA still said "50 years later", would you have responded that "1971 is more than 50 years ago"?


I wish there was some smart machine learning technology that will take my shitty SQL developer queries and make it more performant - sometimes it is like sitting in front of a bubbling cauldron and invoking magic incantations.



All SQL database implement the actual Relational Calculus internally—it's required to implemented a SQL optimizer.

SQL is just a language for submitting Relational Calculus to the database (+ DDL statements).

If you wanted to, for instance, you could add a language alongside SQL in Postgres, submit the results to the internal RC optimizer, execute the optimized query, and get back the results.

In your new language, you can address all of the issues Carlin Eng/Chris Date identify in the article.


In "Stating the obvious", C.J. Date wrote 50 (FIFTY) pages solely on the subject of how SQL screwed up with the most fundamental operator of all : testing for equality. I'll repeat : FIFTY pages about everything that SQL got wrong with just the equality operator.

Go figure about the rest ...


I wish there were SQL "primitives" functions instead of the SQL language.

For example if I want to pick a single row by id, with SQL I must send a query string, which results in parsing, which means lower latency.

If I want to randomly select 100k rows among a database of 1 million entries, I need to build 10k query strings (I think?), which won't be fast to parse. I don't think this happens when using C pointers in C, C++ or arrays in other languages.


Isn't that what prepared statements solve? Only parsed on the first execution and after that only the parameters change.


Exactly. Prepared statements or stored procedures avoid the continuous re-parsing of identical queries.


If by 'randomly select' you do mean randomly, the TABLESAMPLE is what you want (although random means random pages not rows and the amount returned may be over or under)

   SELECT *
   FROM Sales.Customer TABLESAMPLE SYSTEM (10 PERCENT) ;
for example. You can of course specify by rows instead.


To get 100k random rows in T-SQL it's:

SELECT TOP (100000) * FROM tblNm ORDER BY newid()


Why 10k queries? You can probably do this with one query.


I think the issue isn't SQL, but the table paradigm for storing data. Humans do not store data in separate tables that need joining, they store data in a fully connected graph (hyper-graph). Its about relationships and hierarchies - the graph allows incredibly fast hierarchical reasoning, as most things involve hierarchical reasoning. The relational table, and Sql by correlation, are terrible at the human approach to working with data especially hierarchies.


It's the other way around.

We use tables because they help us to massively simplify the processing of data.

Do you want to deal with graphs? The algorithms involved are harder than what your intuition is telling you.

In fact, the first databases, before relational databases were in vogue, were hierarchical/network databases.

https://en.wikipedia.org/wiki/CODASYL

They were obsoleted and replaced when RDMS appeared.


You are talking past the other person... too much certainty, too few questions.

CODASYL is probably not remotely close to what the GP is talking about w.r.t. how humans think in graphs.


I work extensively with Splunk which is dominantly based on noSQL underneath (MongoDB, among other, proprietary technologies)

I've also recently been [re]introduced to graph databases (which are highly similar to the pre-relational network database paradigm)

You can simulate graph relationships with an RDBMS or noSQL - but you shouldn't

You can simulate an RDBMS with a graph db or noSQL - but you shouldn't

You can simulate noSQL with graph and RDBMS tools - but, again, you shouldn't

They all have their place - and ca even work quite well together, if you use them as they're intended


> You can simulate noSQL with graph and RDBMS tools - but, again, you shouldn't

What is the feature which makes the noSQL which you shouldn't do in a relational database? To me noSQL always looks like a subset of relational database. The only thing, maybe, is that you can truly put everything in, but with modern JSON features and all the other things I don't see a downside in using relational. (Except a little learning curve, while that can be hidden behind some ORM or something if you really want for the beginning)


The "schema on the fly" is the draw for noSQL - nothing has to be rigidly predefined like an RDBMS mandates


CREATE TABLE foo (doc JSON);

done. If that's really what you want.


For every bit of JSON that comes in?

No

And then having to figure out what all the fields are after the fact?

No

That's why noSQL exists


What is the thin noSQL brings to the table which a relational system doesn't have? (Except that writing a depth search in a JSON document in SQL is a bit more cumbersome, but if that is a concern that can be wrapped in a data access library to generate the SQL)


I think I've explained that above - it's not having to be rigidly tied to a schema :)


Not that my "schema" is really rigid ... just a little more convoluted syntax than "createCollection" but underneath quite similar.


But thats just it, having many silos for connected information is just not the way we humans do it, nor necessary. A single conceptually indexed space/time hypergraph (as we humans do it) for any part of the world is all you need, scalable up to many billions of edges for that one part of the world.


having many silos for connected information is just not the way we humans do it

What do you mean? I can think of many silos containing connected information: municipal residence records, marriage registries, birth records, police reports, tax records, medical records (for each hospital) are all silos connected by a citizen's identity.


You missed the "humans" part of my statement. We humans mentally have a fully linked graph (hypergraph) of what we know through space-time and conceptually indexed. We do not have a whole bunch of disconnected data silos in our minds. Modelling the very successful human approach to data and reality modeling may make more sense then using these 50 year old tables and disconnected silos?


There are more recent aspects of SQL that do hierarchical processing well. Graph processing is often decent depending on the particular DBMS. The syntax and conceptualization of these queries is not as well socialized though.


"Most things involve hierarchical reasoning" begs the question: what is the universe of processing you assume?

Many would counter by saying basic index lookups comprise a vast majority of DB processing time.


sql doesn't specify how data is stored, it's only a description of the data you want. how your data is connected or not at rest is up to the engine and how you leverage its engine-specific features.


Sort of. "Does not specify" does not mean there is no connection between logical and physical layers. The relational model has strong implications for what a storage engine should do well.


> The relational model has strong implications for what a storage engine should do well.

agreed, the implication is that the engine should do everything well. rdbms is a product of being pulled in every direction, there's generally an index strategy for everything including fully connected hyper graphs, and that's how they ought to be because we can write wildly complex recursive sql queries and the goal is an efficient ideally optimal execution plan


> agreed, the implication is that the engine should do everything well.

I don't think you mean "everything, equally well".

But if you mean that database vendors feel pressure to have their "RDBMS" be able to "be competitive" across a wide variety of use cases and metrics, I tend to agree.


Do you think computers should try to store data like humans? Why?

Perhaps you don't mean storage but rather a conceptual model?


My problem with SQL is that it's 99% Excel and 1% A database query language. And the boundaries between are often not well defined. Why even have a SUM function, when it is not accelerated by an index, that makes using it automatically non-scalable. I think there should be a core language, perhaps similar to SQL perhaps not, as an interface to pure DB functionality, all the other 99% could be done in Excel and while it's nice to have, it should be more clearly separated.


> Why even have a SUM function, when it is not accelerated by an index, that makes using it automatically non-scalable.

Because it’s useful and that last part isn’t true? Databases do more than indexes and you have trade offs regarding the impact of adding too many indexes.

It's also not only not a problem for scalability but in reality an important way to _improve_ scalability. Consider a simple example where we do "SELECT SUM(price) FROM orders GROUP BY customer_id". Removing SUM() from the language would massively increase the amount of data which needs to be processed into the response — which makes anything else you're doing like sorting harder – and it prevents various optimizations the database engine might make. For example, on AWS RDS Aurora the database pushes all of that down to the storage nodes so each storage node will return the sums for the records which are on that node and the main node can sum those intermediate values without needing to transfer a single source row over the network. Since SUM() is part of the language and well-defined, that's safe for one database team to implement without the risk of their results not matching a competing database.

Other things to think about are computed views and stored procedures: in both cases, there are situations where these are dramatic performance improvements or otherwise desirable and having a richer language means that those constructs can solve more problems.


But even SUM in AWS RDS Aurora is not optimal in terms of performance. You could do much better by having partial SUMs in the index btree nodes, that way you could get the sum of billions of records without processing any row individually. CouchDB supports this (even though it's by far not the most efficient database). I wish Postgres and others provided access to that sort of low level map-reduce index functionality. Transferring large amounts of data in the response is solvable by using shared memory, so I'm not concerned about that. And a SQL like language could be provided on top of that. If a low level interface was standardized you could even have SQL like languages that are portable between DBs.


> You could do much better by having partial SUMs in the index btree nodes, that way you could get the sum of billions of records without processing any row individually.

It sounds like you don’t want a general purpose database. Something like what you describe might be more efficient on certain queries but only because it’s giving up a lot of flexibility - consider what happens if you add any filtering or grouping changes and those partial sums can no longer be used.

This preference for general systems rather than optimizing for a single query shows through in the tools used for this kind of task: materialized views and indexes using expressions. Those would allow the kind of precomputed aggregates you’re talking about without the kinds of challenges around flexibility and correctness your approach would encounter.

> Transferring large amounts of data in the response is solvable by using shared memory

That’s a common optimization but it’s not a panacea any more than indexes are. For example, relying on that falls over badly once you need a network and the problems I mentioned before are still important — even if your shared memory implementation is perfectly efficient, you’re still looking at keeping data in memory longer so your client can aggregate it when the database engine could do that for you and discard the memory immediately.

Another way of looking at this is to ask why a field with billions in R&D hasn’t already done it, and especially why various big data tools have been significantly less impactful than promised. If you see what seems like an easy way to beat the current options, it’s usually a good time to ask why everyone working in that space hasn’t seen it.


> Something like what you describe might be more efficient on certain queries but only because it’s giving up a lot of flexibility - consider what happens if you add any filtering or grouping changes and those partial sums can no longer be used.

If you're doing a query that is not backed by an index, it's not scalable, so you could also do it in Excel.

> This preference for general systems rather than optimizing for a single query shows through in the tools used for this kind of task: materialized views and indexes using expressions.

Materialized views totally suck (at least in postgresql), because you cannot update them incrementally (yet). A better option most of the time are triggers and aggregate columns, but it requires a lot of micromanagement.

> It sounds like you don’t want a general purpose database

I don't think I want a special feature, map-reduce indexes seem pretty general "mathematically" sound thing.

> Another way of looking at this is to ask why a field with billions in R&D hasn’t already done it, and especially why various big data tools have been significantly less impactful than promised.

Well that's what's exiting about research, that some things haven't been done yet. Even though they may seem fairly obvious.


Perhaps excel is 99% SQL and not the other way around, especially since SQL is older.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: