A Critique of SQL, 40 Years Later

exabrial · on Aug 24, 2022

The only thing that affects my daily life with SQL is that FROM should be before the SELECT keyword. This would _greatly_ improve type-ahead support in SQL IDEs.

Nothing is perfect, but that is really the main beef. Another commentor already nailed having a LIMIT WITH ERROR clause to be specified on UPDATE,DELETE statements and explicitly throw an error otherwise.

SQL is on of my favorite tools to use and I don't see it getting replaced by anything any time soon.

stickfigure · on Aug 24, 2022

I started my career in the 90s writing ROLAP engines, and even though I've spent most of my time since doing "web" development, I still seem to end up having to build engines that generate SQL queries that are dozens of lines long.

The complaints about SQL composability are real. The grammar is fundamentally pretty irregular. Acceptable for humans to type ad-hoc, crappy for computers to generate.

You can like what SQL does for you (I do!) but it's also easy to imagine something a bit better. I thought the article was spot-on. I hope some future SQL x.0 will fix these issues, but also be similar enough to present-day SQL that I don't have to learn a whole new language from scratch.

randomdata · on Aug 25, 2022

> I hope some future SQL x.0 will fix these issues, but also be similar enough to present-day SQL that I don't have to learn a whole new language from scratch.

If SQL is to Javascript then something akin to WASM would be nice. Lower level primitives that allow query systems to be built. SQL could be one such system, but not limited to.

_solr · on Aug 24, 2022

Every time SQL is mentioned on HN someone comes to complain about FROM coming after SELECT. I use SQL every day and not a single time have I found reason to complain about it. Can you give a bit more detail about what's wrong with it being like it is ?

EDIT : thanks all for your reply. I now understand that it is an IDE related thing not something fundamental to the language.

spprashant · on Aug 24, 2022

If you exploring a set of tables you have never touched before, it really neat if you could just type in:

FROM tablename t SELECT t.<press tab>

and some form of autocomplete mechanism, either prefills all the column names from table "t" or suggests the list of columns and/or types associated with it.

This is much better than having to: 1. Run a SELECT with LIMIT statement just to get an idea of the layout. 2. Point and click through the IDE treeview.

Honestly, I don't think it helps a whole lot beyond this functionality, but I can see why folks who are accustomed to thinking in functional pipelines (from -> select -> map -> filter -> collect) can prefer this way of querying.

I think PRQL is one attempt at building something this way.[1]

[1]. https://github.com/prql/prql

andylynch · on Aug 24, 2022

In SSMS and almost certainly many other SQL editors Intellisense works on the select list as soon as you have a working from clause, the only caveat being you can’t autocomplete columns reading when writing a statement purely left to right.

But I think that’s not a big deal at all, and the SQL approach has a certain advantage in putting the type of action (select/update/delete/drop etc) front and centre, which is really quite helpful.

tomjakubowski · on Aug 24, 2022

SELECT FROM widgets VALUES id, factory_origin_id, frobbable_knobs

noirscape · on Aug 24, 2022

DESCRIBE TABLE is a command that pretty much does exactly this (explain what a table contains) and it's a part of MySQL.

If you use PostgreSQL then you can use \d instead.

I'm sure the other RDBMSes have their own equivalent (except for maybe SQLite but if you're using SQLite outside of toy environments or hobby projects, you're doing it wrong).

jjav · on Aug 24, 2022

> except for maybe SQLite

.schema

> if you're using SQLite outside of toy environments or hobby projects, you're doing it wrong

That's an uninformed statement. SQLite is extremely solid production quality code. Of course it's not the universally applicable database solution, nothing is. Sometimes you need Cassandra or its kind, often MySQL|Postgres, other times SQLite is the correct answer.

The_Colonel · on Aug 24, 2022

It's kind of funny to see a claim that the most widely deployed database in the world is useful only for toy projects.

https://www.sqlite.org/mostdeployed.html

bmacho · on Aug 24, 2022

I guess there are more toy projects than serious ones.

The_Colonel · on Aug 25, 2022

Yeah, I mean MacOS, Android, iOS, Chrome, Firefox ... all typical toy projects.

bot41 · on Aug 24, 2022

sheesh lets hope there are no vulnerabilities in there

colejohnson66 · on Aug 24, 2022

SQLite has an 608 times more lines of code for tests than the original code![0] I’d wager it’s the highest test/production ratio there is.

[0]: https://www.sqlite.org/testing.html

The_Colonel · on Aug 25, 2022

Most of those tests are generated. It's wrong to focus on this metric IMO.

colejohnson66 · on Aug 25, 2022

Auto generated or not, the point being made, in regards to @bot41's comment, was that SQLite is thoroughly tested.

The_Colonel · on Aug 26, 2022

It's well tested, but that's still not a guarantee that there are no security issues. Latest acknowledged CVE bug is a month old: https://www.sqlite.org/cves.html

matheusmoreira · on Aug 25, 2022

> if you're using SQLite outside of toy environments or hobby projects, you're doing it wrong

https://www.sqlite.org/mostdeployed.html

> Every Android device

> Every iPhone and iOS device

> Every Mac

> Every Windows 10 machine

> Every Firefox, Chrome, and Safari web browser

> Every instance of Skype

> Every instance of iTunes

> Every Dropbox client

> Every TurboTax and QuickBooks

> PHP and Python

> Most television sets and set-top cable boxes

> Most automotive multimedia systems

> Countless millions of other applications

Look at all those toys.

emptyparadise · on Aug 26, 2022

everyone is doing it wrong :(

ErwinSmout · on Aug 28, 2022

Not necessarily.

If you ponder the importance of proper (robust, reliable, dependable) data management for data that keeps nuclear plants going, for farmaceutical research data, for anything happening on the financial markets, for medical records, for data concerning payroll and the like, etc. etc. then you might appreciate that all the stuff mentioned in the list is indeed really "just toys".

tremon · on Aug 24, 2022

You're missing the point: until you have typed the FROM TableName t in your IDE, SELECT t.<press tab> can not do autocompletion. That has nothing to do with whether SQL supports querying metadata (the ANSI portable way would be through INFORMATION_SCHEMA btw, not DESCRIBE). It's a consequence of the brain not responding to think-ahead queries by the IDE.

__float · on Aug 24, 2022

This may be true, but you've missed the point that this is about aiding autocompletion while you're writing the query the first time.

bombcar · on Aug 25, 2022

Couldn’t the IDEA understand

FROM table SELECT

And when done switch the code to be correct?

andy81 · on Aug 24, 2022

Every RDMBS has this, including SQLite.

holri · on Aug 24, 2022

in sqlite3 it is .schema

ldldldk · on Aug 24, 2022

SQLite absolutely has production level applications, it's much more than a toy.

https://www.sqlite.org/whentouse.html

emaginniss · on Aug 24, 2022

Everyone else is mentioning it from an IDE perspective, but let's also think about logically from a language perspective. When you start a FROM clause and add some JOINs, a few WHERE conditions and maybe GROUP BY, you are building a virtual view of a series of tables, columns, and aggregations. You could even define this data set as an ephemeral table. What you do with that data set afterwards might vary depending on the need, but the data set might not change. Depending on the application, you might select different columns from the data set. We do this naturally using a WITH clause at the beginning of a query.

WITH (combine a whole bunch of stuff) as dataset SELECT a, b, c FROM dataset

This approach just says:

FROM tables... WHERE ... SELECT a, b, c

To me, it does make a lot of sense. This is also the paradigm that some of the graph databases use.

tremon · on Aug 24, 2022

You could even define this data set as an ephemeral table

Exactly! This is where SQL hurts me the most: not being able to store (partial) query expressions in variables for later reuse. The only way to do this is by creating explicit views (requires DDL permissions) or executing the partial query into a temporary table (which is woefully inefficient for obvious reasons).

briHass · on Aug 25, 2022

Isn't that what a common table expression is? Basically a pseudo temp-table to break down queries. Of course, they also allow recursion, which you can't do with a temp table.

tremon · on Aug 25, 2022

Yes, but a common table expression is still bound to only one query. You can use it multiple times within the same query, but you still can't save a common table expression in a variable and re-use it in multiple queries.

This is what I'd what to do if common table expression really were common:

  SELECT c1, c2
  FROM DifficultJoinStructure
  AS myCte;

  WITH myCte
  SELECT c1, c2
  WHERE SomeCondition(c1);

  WITH myCte
  SELECT c1, c2
  WHERE SomeCondition(c2);

TristanBall · on Aug 25, 2022

So.. a view then?

Granted most environments effectively treat views as DBA/Sysadmin owned objects, especially where end users/apps are effectively sharing one, or a small number, of user accounts.

But given user=schema aspect several of the traditional databases, I get the impression the original intent might have been a little more laissez fair?

Of course the same can be said for tables, and that was perhaps a little idealistic!

Views aren't always quite as composable as you'd like either, or maybe I'm just scarred by the particular DB engines I use most.

So I actually agree with you, but unfortunately SQL requires that the "WELL AKSHWELLY" be followed by one or more "BUT" clauses.

SideburnsOfDoom · on Aug 24, 2022

Agreed, it is not just an "IDE related thing", it is also a logical arrangement of thoughts thing, a readability thing.

kaba0 · on Aug 24, 2022

And as another commenter noted, the underlying relational algebra is also not in agreement, so it is definitely not logical. I believe they wanted to mimic human language statements, but that goal hurt more than it helped.

spullara · on Aug 24, 2022

It makes way more sense honestly. It also looks a lot like a functional pipeline if you set it up that way.

mamcx · on Aug 24, 2022

> I now understand that it is an IDE related thing not something fundamental to the language.

No, is fundamental issue to the language!

The relational model is clear. You START with a relation and then compose with relational operators that return relations.

ie:

    rel | project

Sql do it weird. Is like in OO, where instead of define a class THEN define the properties, you define the properties THEN define the class.

And this fundamental issue with the language goes deeper. The rules are ad-hoc for each sub-operator despite the fact using relational model MUST make it simply to compose.

So, you have rules for HAVING, GROUP BY, ORDER BY, WHERE, SELECT and so on and none are like the others, are different in small but annoy ways...

jolmg · on Aug 24, 2022

Having SELECT come first makes sense to me because it's the only part of the statement that's required. FROM and everything else is optional.

Also when reading a statement, you're mostly interested in what the returned fields are rather details like where they came from or how they're ordered. It kind of makes sense to put it at the start.

Maybe other syntax forms have their benefits, specially when writing, but I don't think SQL's choice is completely senseless either.

mamcx · on Aug 24, 2022

> SELECT come first makes sense to me because it's the only part of the statement that's required.

Only *IN SQL*.

You don't need it on the relational model, heck, no even in any other paradigm:

That is!. (aka: SELECT 1)

So this:

    SELECT * FROM foo

is because SQL is made weird. More correctly, this should be only:

foo

Also, SELECT is not required all the time, you wanna do:

    foo WHERE .id = 1
    foo ORDER BY .id
    foo ORDER BY .id WHERE .id = 1 //Note this is not valid in SQL, but because SQL is wrong!

But you probably think this as weird, because SQL in his peculiar implementation, that is, ok for one-off, ad-hoc query, and in THAT case, having the list of fields first is not that bad.

But now, when you see it this way, you note how MUCH nicer and simpler it could have been, because then each "fragment" of a SQL query could become *composable*.

But not on SQL, where the only "composition" is string concatenation, that is bad as you get.

tadfisher · on Aug 24, 2022

SELECT itself should be optional. Languages with expressions are fairly intuitive, e.g. "int x = foo.bar;" where "foo.bar" is equivalent to the "SELECT bar FROM foo;" SQL statement. I don't breathe SQL every day, so I'm struggling to come up with a case where removing SELECT results in parsing ambiguity.

_solr · on Aug 24, 2022

> I'm struggling to come up with a case where removing SELECT results in parsing ambiguity

It's actually useful to the person reading the code. It clearly defines where a statement starts, what it does and makes reading a query close to reading English. Show a SELECT FROM WHERE query to someone who does not know SQL and the person will understand it. It might be a bit harder if you remove the SELECT.

tremon · on Aug 24, 2022

It clearly defines where a statement starts

And this is another example of the ad-hoc problems of SQL: query terminators (;) are optional. If they weren't, there would be no abiguity where a statement would start: it's the first word after the previous terminator.

mamcx · on Aug 24, 2022

> I'm struggling to come up with a case where removing SELECT results in parsing ambiguity.

Oh! Is super-ambiguous! Make the parser and enjoy it!

Lets make this more concrete:

     city SELECT id
     city ORDER id
     city id <-- Order or select???

You could then "favor" projection as the most important than the others. Ok, so:

     city city city

Which is the table, or the field?

tadfisher · on Aug 24, 2022

I didn't suggest removing ORDER or FROM. This still makes sense:

    id FROM city ORDER BY id

A plus is that sub-selects have more natural, expression-like syntax:

    id FROM (city WHERE elevation > 1000)

jodrellblank · on Aug 26, 2022

How do you know if you are SELECTing or DELETEing?

jolmg · on Aug 24, 2022

You mean the keyword; I meant the clause. "FROM foo" is optional to the syntax.

cmrdporcupine · on Aug 25, 2022

Select doesn't mean that tho. Not in the relational model. It refers to the filtering clause.

Following their stupid English syntax, but using the proper verbs it should rightfully be "PROJECT x, y FROM foo SELECT WHERE a = b"

wwweston · on Aug 24, 2022

Having learned Prolog before SQL, it was weird when it clicked that both were relational languages, but SQL decided to hide that underneath a natural language facade and the inconsistencies that come with it.

cmrdporcupine · on Aug 25, 2022

Yeah. And SQL does it very weird because it mixes up the meaning of "selection". Selection doesn't mean what most people seem to think it means. In the relational algebra "select" is really what is happening in the "where" clause. Projection is the name for the thing that chooses the attributes ("columns") to put in the final result.

cogman10 · on Aug 24, 2022

Imagine a table `Foo` with the schema

`FooId, name, date, favorite_color, active`

Now, you want to pull the ID for the latest `foo` for a specific date, but you don't know any of the column names.

The modern workflow looks like this

You write

`SELECT * FROM Foo`

then you say, "Ok, now I can get autocomplete"

`SELECT FooId FROM Foo`

"Ok, now I can write the where clause"

`Select FooId From Foo WHERE date=?`

It becomes an exercise in moving the cursor around just to get the autocomplete going.

If you are really familiar with the schema, not a problem. But if you just remember a few details about it, then you are stuck in this weird back and forth cursor moving thing.

That's why it'd be more ergonomic to have something like

`From Foo Select FooId where date=?`

Because you never need to move your cursor and you could get all the autocomplete you need at the right times.

This becomes especially true when writing joining statements

    FROM Foo f
    JOIN Bar b ON f.FooId = b.FooId
    SELECT f.FooId
    WHERE b.active = 1

Hinrik · on Aug 24, 2022

Like the commenter alluded to, it allows accurately constrainted type-ahead. If a query starts with "SELECT FROM my_table " and expects one or more column names at that point, your IDE can already suggest the column names from my_table (and not from any other table).

alexvoda · on Aug 24, 2022

As sibling comments have mentioned, this is a valid complaint. It is useful to remember that SQL is an old language already, and there are plenty of warts that in all of this time have been observed. The same way C is old, and there are things to be apreciated about fresh attempts like Zig, Rust, Swift, Nim.

baq · on Aug 24, 2022

you seem to be too used to it to notice...

you start to type SELECT some, columns and can't get autocomplete until you add FROM afterwards, so you either type the query inside out (SELECT FROM table and go back to after SELECT) or just give up.

isitmadeofglass · on Aug 24, 2022

> thing not something fundamental to the language.

It is fundamental to the language. The evaluation order is from,where,group by, having, select, order by, limit.

Everything in perfect order is, select except.

BeefWellington · on Aug 24, 2022

I think this stems from a misunderstanding of the entire point of SQL. It isn't about looking at data in an individual table, it's about retrieving a Result Set for complex queries.

All the FROM-first examples I've ever seen are almost universally the simplest query in the world where autocomplete is not a large hurdle anyways because you aren't even bothering with table aliasing. As soon as you do anything even moderately complex (multiple joins, subqueries, calculations, etc.) the advantage of putting FROM first vanishes, and if you're adding a table alias or hard reference to the table you can already see what is in the SELECT list AND in many cases get autocomplete.

isitmadeofglass · on Aug 24, 2022

> I think this stems from a misunderstanding of the entire point of SQL. It isn't about looking at data in an individual table, it's about retrieving a Result Set for complex queries.

No misunderstanding at all. It’s just more natural to first describe what the sources of the data are, joins etc, and then afterwards which columns you’d like to retrieve, or what calculations you’d like to perform etc.

The current way of having select first is just plain dumb. When it comes to reading order, you never actually know what is selected until you’ve read through the from, where, group by and having clauses anyway, so you constantly have to jump back and forth between dart and end to see the context. And it also better matches the sql evaluation order to have from first and select is after all of these.

Select also just better fits with order_by and limit since it’s also about restricting results after you’ve gather them all up and stitched them together etc.

BeefWellington · on Aug 25, 2022

> When it comes to reading order, you never actually know what is selected until you’ve read through the from, where, group by and having clauses anyway, so you constantly have to jump back and forth between dart and end to see the context.

Let's assume this was accurate, and reframe it to discuss putting FROM first in the query:

    When it comes to reading order, you never actually know where what you've selected is coming from until you've read through the WHERE, GROUP BY, HAVING, and SELECT clauses anyways, so you constantly have to jump back and forth between start and end to see the context.

This is equally true if FROM comes first since you have no idea if customer_name in your SELECT way at the end actually comes from a table directly, a subselect, calculation, etc.

I would be interested in seeing what a complex WITH or subselect would look like if you're putting FROM first, and just how much actual clarity you're getting out of reordering it.

I think the benefit is nonexistent and the harm is that you're going to further fragment an already fragmented syntax space for that nonexistent benefit.

isitmadeofglass · on Aug 27, 2022

> This is equally true if FROM comes first since you have no idea if customer_name in your SELECT way at the end actually comes from a table directly, a subselect, calculation, etc.

When you get to the point where you see customer_name, you know what it came from because everything you read prior to that.

Imagine going into a shop to order three burgers and saying “the first one is without onion, the second is without onions and the third without picked, I’d like a big Mac, a quarter pounder and a hamburger please.” It’s just not the natural reading order.

> I would be interested in seeing what a complex WITH or subselect would look like if you're putting FROM first

The premis for this seems to be that your arguing that moving SELECT to the start of the query even though this isn’t he natural evaluation order nor reading order somehow helps to increase readability. But I don’t believe the there’s any reason to assume that. Just because it’s how things are currently and how you learned it doesn’t mean it’s better, it just means you’ve grown accostumed to it and therefore you feel it’s simpler. But everyone learning SQL from scratch who don’t initially learn actively that select isn’t applied in the order it’s written ends up making the same silly mistakes and being confused about the same things that would be obvious if we just put it where it actually belongs.

polishdude20 · on Aug 25, 2022

I feel this also follows the concept of narrowing down from a large set to a smaller set. Using FROM first is the large set and then you're selecting from it.

goto11 · on Aug 25, 2022

Joins would presumably still be in the from-clause, so you get:

   FROM a JOIN b JOIN c
   SELECT b._

Here aoutocomplete helps also with the joined tables. Autocomplete would also help in subqueries:

  FROM a JOIN (FROM b WHERE b._

mmcdermott · on Aug 24, 2022

That's a fair point. This comment made me realize my own tendency to write "select * from..." so that I could supply the "from" before going back and replacing "*" with specific columns.

tehlike · on Aug 24, 2022

Check this out https://github.com/prql/prql

lostdog · on Aug 24, 2022

Wow, that looks incredible. It looks like you can pretty much refine your query line by line and see the intermediate results to know what you're working with.

qorrect · on Aug 24, 2022

Love PRQL :D

vivegi · on Aug 24, 2022

Several decades ago while using Oracle and SQL*Plus, I used to generate the column list for queries from the data dictionary table (ALL_TAB_COLUMNS). Once I learnt that trick, I never typed the column list ever again. Eventually, I had a library of queries for common tasks. You could use that trick with almost all database engines.

IDEs were never a favorite (they were quite limited then).

Today, the situation is a lot better.

d0mine · on Aug 24, 2022

Configure your IDE, to help you write SELECT queries. It should be easy to insert a template with jump points in the desired order e.g., a generic YASnippet in Emacs should be able to do it.

Thus by the time you are writing the column expression after the SELECT, the FROM table expression would be filled already (enabling scope-dependent column completion).

Koshkin · on Aug 24, 2022

FWIW in LINQ, Select comes after From and Where.

sedatk · on Aug 24, 2022

Yeah, Select comes last, and it makes the most sense.

jiggawatts · on Aug 24, 2022

Also in Kusto Query Language (KQL), which is used extensively in Azure.

crazysim · on Aug 24, 2022

I don't know why my memory jogged to the old ruby on rails screencasts but I suppose there's nothing stopping IDEs from jumping to the FROM section of a query and then returning you back to the SELECT in some sort of "macro"/snippet. It's a hack, I guess.

kaba0 · on Aug 24, 2022

That’s exactly what Jetbrain’s Datagrip does.

mordechai9000 · on Aug 24, 2022

From what I've seen, it will initially populate autocomplete column name values from all the introspected tables on the current schema search_path, even before the FROM clause is added. Usually this is good enough for me, so I don't think about it. I mostly interact with postgres.

v0idzer0 · on Aug 25, 2022

Just a few years ago Javascript was pretty awful, so people started writing transpilers (most famously Babel) to add features like arrow functions, async/await, decorators, constants, etc. Typescript is another transpiler that greatly improves the language but outputs regular old, awful javascript. Many of these features even ended up being built into the language.

I wonder if it’s time for a transpiler renaissance in SQL? It would be trivial to convert “select id from users” into “from users select id”. I’m sure theres a lot of other cool feautures/sugar you could add while still outputting valid SQL.

mordechai9000 · on Aug 24, 2022

Maybe we should treat SQL like JavaScript, and use it as a compiler target instead of coding in it directly. /s

dagss · on Aug 24, 2022

Not sure why you added /s; .. SQL is in many ways exactly lile JavaScript. The only way to run your code where it needs to run, but with some things to be desired language wise.

I wish more languages compiling to SQL were more common.

(Not ORMs though they just miss the point entirely..)

ilkhan4 · on Aug 24, 2022

I mean, that's pretty much what ORMs do, right? Hasura et al too.

mordechai9000 · on Aug 24, 2022

Yes, this is true. Although my experience with ORMs is that there is always a reason to use some kind of escape hatch to run raw SQL directly.

scrozart · on Aug 24, 2022

Agreed. After spending the better part of a decade watching Hibernate a) fail to scale or b) get confused beyond a third level of association, I find myself writing bare SQL and/or refactoring to implement the repository model. ORMs are great for prototyping or if your data model is simple; beyond that you're better off without them.

randomdata · on Aug 25, 2022

Most higher level regular programming languages will allow you to escape to C, and C allows you to escape to assembly, so that's not unusual.

In my experience when you need to escape to raw SQL there is still some kind of query builder employed, which too becomes another "compile to SQL" source.

qorrect · on Aug 24, 2022

I know you're being sarcastic, but yes!

aspyct · on Aug 25, 2022

Agreed.

I would also love to be able to specify the WHERE just after the FROM but before the JOINs. And also after them: - the first one to filter the rows before joining - the second one to filter after joining

taeric · on Aug 24, 2022

You could improve type ahead by just pouring effort into the IDE. Especially with the amount of resources we have nowadays. Easy enough to have basic typeahead on basically all possible columns when writing the select, and then you could use the columns as a filter on the tables during auto complete.

In general, this isn't done. But I don't see any technical reason it can't be done.

chrsig · on Aug 24, 2022

This strikes me as solving the wrong problem. It's not as useful to be able to derive a table from selected columns -- the desire is to complete columns from a table. It's both more intuitive and helps the user more than once.

taeric · on Aug 24, 2022

I disagree. Folks often know what column they want, and have to find the table that best gives it.

refactor_master · on Aug 24, 2022

If you have multiple tables, how do you avoid suggesting a wrong column, before filling out the table name?

You could suggest all columns up front, and then afterwards tell the user “this suggestion doesn’t exist”, which would just erode trust in the autocomplete.

kaba0 · on Aug 24, 2022

Start typing “SEL”, ide suggests “SELECT _ FROM”, you press enter and the above text is entered, while the cursor is placed after from. You write that part of the query and after pressing enter it will jump back to the select part.

This is done already by Jetbrain’s datagrip for example.

taeric · on Aug 24, 2022

Suggest all. And with modern UIs, you can hint the table with the column.

SnowHill9902 · on Aug 25, 2022

If you don’t wrap your UPDATEs in a transaction you are doing it wrong.

camgunz · on Aug 24, 2022

You can just put SELECT * FROM stuff and then edit your columns later.

guenthert · on Aug 24, 2022

We have transactions to prevent mishaps, no?

cmrdporcupine · on Aug 24, 2022

"Despite the improvements listed above, in a 2014 interview, CJ Date said “we didn’t realize how truly awful SQL was or would turn out to be (note that it’s much worse now than it was then, though it was pretty bad right from the outset).” This quote leaves me wondering – if Date himself were to write an updated critique, what would it look like? My best guess is most of his criticism would revolve around further departures of SQL from the relational model, but specific examples escape me."

I am not sure how the author has missed the copious amount of material that Date (& Darwin) wrote on this topic? The obvious one being the Third Manifesto (whole book available here [1] since it's out of print now). But even "Database in Depth" [2] has extensive discussion about this despite being a less "opinionated" and "ranty" book (highly recommended BTW)

Yes, not explicitly updated since 2014, but there's really nothing new to say since then?

Date is definitely concerned about the departures from the relational model in SQL. But it's (EDIT) not just about purism. He's also pretty in general peaved off at SQL for similar concerns that the author presents here about syntax and expressiveness. Third Manifesto's "tutorial D" mandates the addition of a richer type system, including higher level types that match on relation constraints, and "operators" (similar to OO methods) on those types; basically a kind of predicate dispatch.

I imagine he's pretty sick of being ignored.

[1] https://www.dcs.warwick.ac.uk/~hugh/TTM/DTATRM.pdf

[2] https://www.oreilly.com/library/view/database-in-depth/05961...

fipar · on Aug 24, 2022

I agree with everything you're saying, though I'm assuming here:

> Date is definitely concerned about the departures from the relational model in SQL. But it's just about purism.

Perhaps you meant to say "But it's not just about purism"?

cmrdporcupine · on Aug 24, 2022

Oops, yes, thanks for the catch. Editing.

ErwinSmout · on Aug 27, 2022

In fact, Date has recently revised a whole slew of "old" papers of his, got them up-to-date wrt his current thinking, and (re-) published them. His "A critique of SQL/86" paper(s) are now chapter 8, 9 & 10 of "Database Dreaming, Vol I" (there's a Vol II too).

crazygringo · on Aug 24, 2022

I've written a bajillion queries and have tons of nitpicks, but it's the twin meanings of NULL that really kills me.

NULL can be the value of a field in a record, but it is also used to indicate the lack of a record in a JOIN.

If I run:

  SELECT x.a, y.b FROM x LEFT JOIN y on x.a = y.a

and I get back

  [5, NULL]

I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

Obviously there are lots of workarounds, but to me this is a truly fundamental design flaw. SQL should have specified another NULL-like value, call it EMPTY, used only for joins that find no data. And analagous to IS NULL, it would be checked using IS EMPTY.

dspillett · on Aug 24, 2022

> but it's the twin meanings of NULL that really kills me

NULL only has one meaning: NULL. This is roughly analogous to unknown.

The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match.

> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

Not directly, but you can infer the difference going by the value of (in your example) y.a - if it is NULL then there was no match, otherwise a NULL for y.b is a NULL from the source not an indication of no match.

> SQL should have specified another NULL-like value

This sort of thing causes problems of its own. Are the unknowns equivalent? Where are they relevant? How do they affect each other? Do you need more to cover other edge cases? I have memories of VB6's four constants of the apocalypse (null, empty, missing, nothing).

This is one of the reasons some purists argue against NULL existing in SQL at all, rather than needing a family of NULL-a-likes.

crazygringo · on Aug 24, 2022

> NULL only has one meaning: NULL. This is roughly analogous to unknown.

Which is what I'm arguing against, because it's used in two unrelated ways in SQL -- as a data value, and to express no matching row found in a JOIN. So no matter how it's defined formally, in practice it has two meanings that have nothing whatsoever to do with each other. One is a value and can be stored, the other says 'not found' and results only from expressions and isn't for storage.

> but you can infer the difference

Yes, as I said there are lots of workarounds. But they're still workarounds.

> This is one of the reasons some purists argue against NULL existing in SQL at all

Sure, but the reality is that NULL is used in a practical sense to mean "no data entered". It's so ridiculously common that only some columns have no data for a particular row, and so e.g. you need the 'time_finished' column to either have a valid date, or it's NULL if the activity hasn't finished yet. The alternative is to have an additional boolean column 'is_finished' and for 'time_finished' to be arbitrarily '1970-01-01T00:00:00Z' whenever is_finished is false, which is clunky and redundant.

Purists can argue what they want, but NULL is so ridiculously useful as a stored data value it doesn't really matter.

jon_richards · on Aug 24, 2022

> you need the 'time_finished' column to either have a valid date, or it's NULL if the activity hasn't finished yet

Rust’s Diesel has an interesting way of dealing with this. “None” means missing while “Some(None)” means the value null. So when updating a record, “None” makes no change while “Some(None)” sets the value to null.

This distinction is obviously lost when retrieving from the database, but it’s an interesting concept.

yen223 · on Aug 24, 2022

Javascript having both `null` and `undefined` doesn't seem so crazy now

beefjerkins · on Aug 24, 2022

But in classic Javascript style, it's half-assed. Being able to assign anything to undefined makes the distinction between null and undefined pointless.

SoftTalker · on Aug 24, 2022

If you have a business need to represent "empty" or "n/a" or "declined to answer" or something like that, use a specific value for that. NULL does not mean anything. Or, it means nothing. It's just NULL. Once I got that into my head, SQL became less frustrating.

crazygringo · on Aug 24, 2022

You can't without defining yet another field, usually. It's really annoying to double the number of columns so that [time_started, time_finished] becomes [time_started_exists, time_started, time_finished_exists, time_finished].

NULL values makes business logic far more compact and intuitive. For enumerated values in fields it's easy enough to define another value in the same field to mean 'unknown' or 'not entered', but you can't do that for strings, numbers, datetimes, etc. -- you have to throw in a bunch of unwieldy additional boolean fields instead.

Quekid5 · on Aug 24, 2022

The underlying problem here is that SQL lacks Sum Types (aka Tagged Unions). Such types solve all these problems effortlessly.

In contrast to what SQL has, Sum Types combined with Product Types (which is basically what a row is) are actually a universal way to model all possible data[0]. (Of course you may want syntax sugar, etc. on top of that, but Sums and Products at the bottom is sufficient.)

[0] I'm actually not sure if I should qualify that -- I believe Sums+Products can actually model anything, assuming you allow recursive type definitions -- which might be hard to make perform well. Storing a linked list in a database field, e.g. might not be the best idea.

Koshkin · on Aug 24, 2022

Indeed, the issue of non-existence is a somewhat tricky philosophical question (e.g. there are many ways something may not exist).

andy81 · on Aug 24, 2022

A better option in many cases is to check the primary key.

e.g.

select questions.Id , questions.Text , answers.Id , answers.Text

from questions left join answers on answers.QuestionId = Questions.Id

answers.Id is non-nullable as a primary key, so if answers.Text is null but answers.Id is not null they've declined to answer.

dspillett · on Aug 24, 2022

That may be implementation or circumstance specific. For instance in MS SQL Server with a heap table (one without a clustered index) or a table where the primary key is not the clustering key, it will result in extra page reads to check the other field's value (the query planner / engine could infer from it being the PK that it can never be null, so the lookup to check is unnecessary, but IIRC it does not do this). As the columns used in the join predicate have to be read to perform the join, no extra reads will result from using them for other filtering.

In your example it is very likely that the primary key is the clustering key, so will be present in the non-clustered index that I assume will be on answers.questionId, making my point moot, but if for some unusual reason neither Id nor questionId were the clustering key checking Id may result in extra reads being needed.

In DBMSs without clustering keys implemented similarly to SQL Server, there may be such concerns in all cases.

onlyrealcuzzo · on Aug 24, 2022

> This is one of the reasons some purists argue against NULL existing in SQL at all, rather than needing a family of NULL-a-likes.

So then are there no optional values? What happens when OUTER JOINs don't match?

dspillett · on Aug 24, 2022

I assume NULL would still exist there, but there would be no explicit NULL values permitted in tuples (rows) that actually exist.

Or perhaps the purists would remove outer joins too, it isn't since University that I've read around the discussion, but given alternate syntax to do the same thing can sometimes be convoluted that might be a bad idea itself.

ErwinSmout · on Aug 27, 2022

In the relational model, no there is no such thing as "optional values".

Even Codd often hinted (quite a bit of material in the 1990 book gives such hints) that "if a row has a null somewhere, it SHOULDN'T BE CONSIDERED AS BELONGING TO THE RELATION". E.g. if an attribute of a FK is null, no FK checking should be done at all on account of that null. The FK checking should be done only when the FK value is "complete".

As for outer join, one problem with it is that its result is (intended to be) the extension of a disjuctive predicate, the disjuncts in question being exactly the predicate for the "matching" case and the other one the predicate for the "no match" case. Now if you see a row in the result, how can you tell from that row which of the two disjuncts made that row appear in the result ?

Outer join is problematic by definition precisely because of the possibly ambiguous interpretation that derives from its disjunctive predicate.

(Would you do that with base tables ??? I mean design a base table that can hold a row for strictly more than one possible reason ??? I mean where the user then sees a row in that table and subsequently cannot tell whether that row is there for reason/meaning 1 or for reason/meaning 2 ??? I'll tell what you -and anyone else- would do in such a case : you'd also add the indicator telling whether it's meaning 1 or 2 (and then the problem of 'what if it's both' is also solved). Now think back about what could be done in outer join to resolve that very same kind of disjunctive ambiguity.)

vbezhenar · on Aug 24, 2022

You don’t do outer joins.

Koshkin · on Aug 24, 2022

Don’t these exist for a reason?

ErwinSmout · on Aug 27, 2022

Yes. The reason is lazy people expect to be able to ask five questions in one go and expect one single answer to answer all five of them.

The key is to realize that submitting a query is to ask a question. So how many queries to you think you have to issue if the number of questions you have is five ?

It is technically perfectly possible to "ask five questions at the same time" to any given truly relational DBMS. Said DBMS will give five answers at the same time (and not leave you to guess which answer corresponds to which question). That SQL doesn't do it, does not mean it's impossible.

andy81 · on Aug 24, 2022

Of course.

You've got a list of countries and are pulling each country's national flower, national bird, largest port city etc.

Without outer joins, Liechtenstein with no ports doesn't show in the list at all. Sad news for people who want to know all countries, or Liechtenstein's national bird (eagle).

vbezhenar · on Aug 24, 2022

You’re not obliged to pull everything in one request. You can issue several requests.

tremon · on Aug 24, 2022

Sure. You're also not obligated to include a WHERE-clause in the query you send to your database. You can do the filtering in the application.

vbezhenar · on Aug 24, 2022

What's your point? You can live without nulls, they're not required and you don't need to avoid WHERE-clause for that. It's a theoretical concept, nobody in sane mind would do that, but nulls are not required for relational algebra.

tremon · on Aug 25, 2022

You started this subthread by asserting that one shouldn't use outer joins, and when other people pointed out valid use cases for outer joins, you advocated to use application logic to work around your refusal to use them.

My response was just to point out the absurdity of your premise.

ErwinSmout · on Aug 27, 2022

"NULL only has one meaning: NULL. This is roughly analogous to unknown."

From the re-published version of "Much ado about nothing, part 2" (re-published in "Database Dreaming, Vol II) :

"An outer join produces nulls because we ask the DBMS to show us data from one table even when no matching data can be found in another table. Such nulls have no meaning, no semantic content."

Personally I'd add to that : and even if it's the case that they do, that meaning is patently not "simply unknown" but rather a very clear indication of non-existence.

But I have no doubt the apologists will obviously always apologize no matter what and handwave arguments such as these away with "So what ? What's the difference." Well, it's the difference between knowing of non-existence and not knowing at all.

ErwinSmout · on Aug 27, 2022

"The one that his a lot of people is WHERE <value> NOT IN (<set>) where <set> contains a NULL. Because NOT IN unrolls to “<value> <> <s1> AND <value> <> <s2> AND … AND <value> <> <sx>” any NULL values in the set makes one predicate NULL which makes the whole expression NULL even if one or more of the other values match."

Sorry, but this is FALSE.

"Even if one of the other values match" will make that particular INequality test FALSE and one FALSE conjunct will make the entire conjunction FALSE, even in 3VL.

chasil · on Aug 24, 2022

NULL itself is open to interpretation.

If I CREATE TABLE foo (bar CHAR(1) PRIMARY KEY, baz char(1) UNIQUE), then different things happen on different databases.

In Microsoft SQL server, only one insert of a null into the baz column is allowed, and the null value is indexed.

In Oracle, null is never indexed in this context, so any number of null insertions into baz are allowed.

On a composite index, I believe that nulls are always indexed, on any database (they must be).

As far as I understand it, this is implementation-defined:

  SQL> select * from dual where null=null;

  no rows selected

dspillett · on Aug 25, 2022

WHERE NULL=NULL should never return true, so no rows returned is correct. It is only implementation defined in that some implementations get this wrong!¹³

SQL Server's handling of NULLs with in unique indexes/constraints is not standard. You can get better using a filtered unique index (“CREATE UNIQUE INDEX more_compliant_unique_index ON dbo.some_table(this_column) WHERE this_column IS NOT NULL;”), but be aware that filtered indexes have their own collection of gotchas⁴.

----

[1] MS Access IIRC, MS SQL Server if the ANSI_NULLS option set to off, MS² SSIS's default (consider NULLS equal) behaviour for merge steps, and so on.

[2] You may be noticing a pattern here… MS are not the only culprit though.

[3] I'm making a distinction here between places where behaviour is not defined in the standards, so implementations are free to go their own way, and where implementations go against what is expected by the standards.

[4] particularly with respect to the collection of unfixed/wontfix gnarly edge cases that is the MERGE statement.

goto11 · on Aug 24, 2022

A nullable column can always be extracted to a separate table with a non-nullable column and a foreign key. If you left-join this back with the base table, you will get the nulls again for the missing values. So to me it seem nicely symmetrical to use the same kind of NULL value.

ianmcgowan · on Aug 24, 2022

I'd check to see if y.a IS NULL in that situation. I'm sure there are cases where it matters, but most of the time for me the difference between "there's a row in y, but the value is NULL" and "there's no row in y, the value is NULL" is irrelevant. I can't think of a time when that distinction has been important, and I'm working on a project converting thousands of complex SQL queries.

The thing that really bugs me about NULL is the default assumptions - 99.9% of the time I want NULL in a string context to be the empty string and NULL in a numeric context to be 0, but I have to use ISNULL or COALESCE to get that. I wish it were the other way round where NULL is treated as '' or 0 by default, but I can do something special if I really want to test for NULL'ness.

crazygringo · on Aug 24, 2022

Interesting, we must work with really different data/queries.

I've come up against the distinction lots of times (and yes have to retrieve additional fields in order to address it), while I don't ever want to confuse 0 with NULL. Tons of things are legitimately zero but crucially non-null, like an inventory count. (Empty strings, on the other hand, do seem much more interchangeable with NULL in probably the vast majority of contexts.)

emmelaich · on Aug 25, 2022

'' is NULL in Oracle. People complain about it, and it always seemed a little weird to me.

rawgabbit · on Aug 25, 2022

As another commenter said, if your table have required ID columns, you should specify it in your SELECT.

     SELECT x.Id,x.a,y.ID,y.b 
     FROM x 
     LEFT JOIN y 
       on x.a = y.a

The output [Id12345,5,Id6789,NULL] means the column is null. The output [Id12345,5,NULL,NULL] means the row is empty.

On a side not, I use the SQL below all the time to identify missing rows on the "comparetable".

  SELECT #basetable.columnname
  FROM #basetable
  LEFT OUTER JOIN #comparetable
  ON #basetable.columnname1 = #comparetable.columnname1
    and #basetable.columnname2 = #comparetable.columnname2
    and #basetable.columnname3 = #comparetable.columnname3
  WHERE #comparetable.columnname1 is null

cmrdporcupine · on Aug 25, 2022

Fabian Pascal made a whole career out of writing about how stupid "NULL" is in SQL. See if you can find his writings some time. He was positively livid about it.

Notably NULL of any kind is completely absent from the relational algebra. There are other ways to express absence. Notably, this is all supposed to be first-order predicate logic <hand wave hand wave> stuff, and... in that world "null" is also not a "thing".

Also SQL insists on allowing duplicate "rows" in its "tables"; whereas the relational model is very clear that the contents of a relation ("table") is a set, not a bag.

These two confusions in SQL actually lead to many many data integrity problems as follow-on effects and also complicate query planning and optimization.

paulclinger · on Aug 24, 2022

> I have no way of knowing if that means there's a record [5, NULL] in table y, or if there's no record in table y that starts with 5.

You can always add SELECT y.a, which will allow you to disambiguate between the two options (it will be non-NULL in the first case and NULL in the second).

cryptonector · on Aug 24, 2022

You need one more output column. The specifics will vary by RDBMS because SQL isn't that standard. The most portable thing would be to do this:

  SELECT x.a, y.a IS NOT NULL, y.b
  FROM x
  LEFT JOIN y ON x.a = y.a;

assuming y.a is a NOT NULL column, or a PRIMARY KEY column, since PKs are supposed to be non-nullable.

In PG you could:

  SELECT x.a, y IS DISTINCT FROM NULL, y.b
  FROM x
  LEFT JOIN y ON x.a = y.a;

ErwinSmout · on Aug 27, 2022

The predicate corresponding to an outer join is fundamentally disjunctive in nature (any outer join is equivalent to a UNION of at least 2 SELECTs) meaning that "after the fact", there is no way to determine which particular one of the disjuncts made the row appear in the result.

SPBS · on Aug 25, 2022

Why does it kill you? A NULL in a row is semantically identical to the row never existing in the first place, unless there is some other non-null column (the primary key) that lets you disambiguate between the two cases. It's hardly a problem in practice because all good tables should have primary keys.

janci · on Aug 24, 2022

This is why JS has both null and undefined.

tremon · on Aug 24, 2022

Ah, JS, the pinnacle of data correctness. 1 == "1" results in undefined, right?

merb · on Aug 24, 2022

nope 1 == "1" is true and 1 === "1" is false

onlyrealcuzzo · on Aug 24, 2022

How do you expect `y.b` to behave on EMPTY rows? Should it return EMPTY or NULL?

gfody · on Aug 24, 2022

you should not need to distinguish between empty and null because they mean the same thing. if you want to select records from x that aren't in y you should use an anti-join (where not exists, not in, etc.)

thedataslinger · on Aug 24, 2022

This is not true--especially when considering different SQL implementations (e.g. Oracle SQL versus Microsoft SQL). NULL and EMPTY handle the intersection of ontic versus epistemic claims.

EMPTY implies a known, 0-byte value whereas NULL can imply either an unknown value or the "unknowability" of a value (i.e. the in-existence of a value).

In practical terms, this would be like equating the statements "I don't know whether that dog has a name" (i.e. a NULL name) and "I do not know the name of that dog" (an EMPTY name). The former does not assert the existence of a proper noun to represent "that dog", whereas the latter implicitly asserts that there exists a proper noun which nominates "that dog".

gfody · on Aug 24, 2022

beg pardon I wasn't aware sql actually defined empty. of the engines I'm familiar with there is no keyword for empty like what gp is asking for (and I hold my argument for why it should not be necessary). it sounds like what you're describing is a blank string and I agree with using blanks to distinguish between "has no phone number" and "we don't know their phone number" but that isn't what gp was talking about.

janci · on Aug 24, 2022

Also you should select y.a to know wether the y-record exists.

Shorel · on Aug 24, 2022

Good point, it would be nice to have both NULL and EMPTY to split these meanings.

thedataslinger · on Aug 24, 2022

To piggy-back, it bothers me so much that this is valid syntax in many implementations:

UPDATE x SET a = NULL WHERE b IS NULL;

Like... wat?

rrrrrrrrrrrryan · on Aug 24, 2022

Why are you joining on a nullable column in the first place?

If your database is well designed, joining on a nullable column should be a relatively exotic use case, and can be handled by writing a tiny bit more code to check for NULLs before performing your join.

crazygringo · on Aug 24, 2022

It's not.

Assume y.a is the non-nullable primary key, while y.b is the column that may be null.

But in this example there might not be a row where y.a = 5. Or there might be. But you can't tell.

tremon · on Aug 24, 2022

But in those cases, you can always tell by including the non-nullable key in your result set. It's already being evaluated, so it's virtually free (only adds to the network transport size).

bearjaws · on Aug 24, 2022

The only thing I would really blame solely on SQL is that UPDATE and DELETE statements don't require you to specify a limit.

I have seen many times in my career where a rogue delete just truncates a table, a simple statement of intent (e.g. LIMIT 1) would tell the query planner that if it is about update/delete more than 1 row, it should error. In fact MySQL actually returns a warning if you do this.

TRUNCATE clearly states your intention to delete everything, but DELETE by default also deletes everything.

There were also definitely some bad paradigms invented as a result of SQL (e.g. all business logic lives in the database as stored procedures or stored functions), but that was bound to happen whenever a shiny new tool comes out and engineers want to use it.

dspillett · on Aug 24, 2022

> TRUNCATE clearly states your intention to delete everything, but DELETE by default also deletes everything.

I think the other way around.

“DELETE Something” to me means get rid of the whole of Something. “TRUNCATE Something” (in general English, not SQL) to me means to make something shorter, which probably doesn't mean removing it completely.

Perhaps “EMPTY Table” would have been a better choice than “TRUNCATE Table”, but by the time it reached the standard¹ TRUNCATE had long since been picked and until then it was not the responsibility of the SQL standard: you only had DELETE and without a filter it makes sence that this deletes all.

----

[1] note that TRUNCATE was not in the SQL standards until SQL2008, long after it was common in various DMBSs as a minimally logged alternative to DELETE-without-filtering-clauses.

merb · on Aug 24, 2022

> TRUNCATE clearly states your intention to delete everything, but DELETE by default also deletes everything.

btw. they are completly different.

TRUNCATE basically ignores all transaction semantices for the sake of performance (which can be really really bad) in mysql it can't be rolled back, in postgres it's not mvcc safe. TRUNCATE most often uses either an exclusive lock or some other mechanism.

if you know what you are doing truncate can be helpful, but delete has its benefits as well and truncate should be avoided by beginners.

andy81 · on Aug 25, 2022

The other important factor is permissions.

Truncate in MSSQL requires alter permissions, so a user that can wipe a table can also change its definition. Not true with delete.

ErwinSmout · on Aug 27, 2022

"There were also definitely some bad paradigms invented as a result of SQL (e.g. all business logic lives in the database as stored procedures or stored functions),"

I hope you don't mean by that that integrity enforcement (checking the compliance of the data with ALL declared business rules) does not belong in the DBMS. Because the exact opposite is true : inside the DBMS is the only place where it belongs. Or if not the only, then certainly still the primary.

Shorel · on Aug 24, 2022

Require?

Some UPDATEs don't allow me to specify a limit!

Infuriating stuff.

SnowHill9902 · on Aug 25, 2022

So don’t COMMIT your UPDATEs.

Cockbrand · on Aug 24, 2022

While all the criticism is probably correct, and SQL definitely shows its age, it's still very much good enough for pretty much all its current applications. Also, SQL's basics are very easy to learn. These aspects make it very hard to imagine a language that might gain enough traction to actually replace SQL in a foreseeable future.

cmrdporcupine · on Aug 24, 2022

Many have tried. Most are hobby projects. None have had the full force of a real production ready DBMS behind them.

And really, projects like this are fighting the general ignorance the industry as a whole has about what a relational database actually is. It's better since the NoSQL wave crested and we stopped hearing stupid shit like "my data isn't tabular and doesn't fit in a schema", but there's still a preponderance of people who haven't stopped to learn the lessons that Date and Codd tried to teach decades ago, and constantly reach for graph/network/hierarchical databases without understanding why we originally tried to move beyond those back in the 70s.

Unfortunately because of this most attempts to do "better than SQL" end up not working from first principles and look like a dogs breakfast of ill-formed "that'd be neat" ideas.

And, yeah, as I said the "pure relational" projects that are out there are usually either academic projects or hobby. (Yes I've had a few of my own)

I think there might be hope in the Datalog-type systems that have emerged recently.

ErwinSmout · on Aug 27, 2022

:-)

And those that are built on "first principles" end up looking like a dog's breakfast just as well because people expect to not have to learn a novel language.

People don't want to be told to forget everything they thought they knew. People don"t want to be told they've been wrong for 40 yrs. Even if you add that it's not entirely their own fault because they've been consistently misled by an entire industry.

You can crack the problem of supporting CREATE ASSERTION, they'll still dismiss it (handwave handwave) believing you're just bluffing.

And I mostly disagree with the optimism built on Datalog.

cryptonector · on Aug 24, 2022

Any alternative to SQL has to transpile to SQL in order to gain traction.

cmrdporcupine · on Aug 24, 2022

Which right away rules out a whole bunch of more sophisticated and elegant behaviours, honestly.

The other alternative might be to implement one's new thing as a patch to alter the frontend of Postgres. I looked at this many years ago and the engineering effort was immense. But it might be easier now.

zasdffaa · on Aug 25, 2022

Why does it rule out 'more sophisticated and elegant behaviours'?

cmrdporcupine · on Aug 25, 2022

If your language has to interoperate with data in the SQL database, it right away has to deal with SQL's limitations and oddities:

1) SQL's type model has no concept of nested relations or rich datatypes; it's limited to basic scalar types <tho there are vendor additions to this>, so implementing e.g. Date's concept of "relation valued attributes" <essentially tables-in-tables> for example would be impractical

2) SQL's storage model doesn't map 1:1 with e.g. a relational model because SQL tables are bags instead of sets (duplicate rows are possible). This has a couple downsides. a) whatever more expressive and relational algebraic language you built would have to inherit that model and and have some way of dealing with bags instead of proper relational ranges <which are sets> and b) the presence of the possibility of dupes actually makes some query planner optimizations impossible to perform. <CJ Date goes into more depth explaining this, I don't have room for it here>

3) Likewise similar problems with SQL NULLs. NULLs are not in the "pure" relational model, so your language can't be pure. NULLs mess with the type system, and you end up inheriting a bunch of the weird casting/identity/comparison issues that come with NULLS. And NULLs, again mess with the query planner.

Finally, the execution model of SQL is always against a DBMS table or view. So in-memory relations, etc. look difficult to implement.

cryptonector · on Aug 26, 2022

Thanks for listing these. It's very useful.

(1) is a function of what SQL you might transpile to. Any SQL w/ rich types (like PG) or w/ JSON support (most others) will be able to meet (1) with varying levels of hackiness.

(2) even if some SQL RDBMS allows you to have tables w/o primary keys, the thing transpiling to SQL wouldn't be obligated to have tables w/o primary keys. Some SQL RDBMSes do not allow tables to lack PKs (e.g., SQLite3 doesn't), but even this wouldn't be a problem -if one wanted tables-with-dups- because one could use hidden row IDs to make up for that.

(3) just as with (2), the new thing transpiling to SQL wouldn't have to allow NULLs, and could generate SQL that uses `IS [NOT] NULL` / `IS [NOT] DISTINCT FROM NULL` to obtain the desired semantics if the new thing kept any notion of NULL.

(4) is not a semantics issue, and anyways, some SQL RDBMSes let you implement tables yourself, and for the others there's CTEs and TEMP tables.

Of these only (1) is a problem if you want to target pure standard SQL, though it depends on which SQL standard you target.

We're definitely not talking about transpiling to 1980s SQL!

So, while there might some semantic impedance mismatches for a transpiling approach, and the choice of RDBMS to transpile to matters, on the whole I'm unconvinced that there are such problems that are very serious. My own imagination is failing me in my attempts to come with such semantic impedance mismatches.

zasdffaa · on Aug 27, 2022

Interesting. Ok

> (3) just as with (2), the new thing transpiling to SQL wouldn't have to allow NULLs, and could generate SQL that uses `IS [NOT] NULL` / `IS [NOT] DISTINCT FROM NULL` to obtain the desired semantics if the new thing kept any notion of NULL.

You may disallow nulls stored, but anything like a non-inner join will produce nulls and then you're almost back to where you started.

> (4) is not a semantics issue, and anyways, [where 4 is "Finally, the execution model of SQL is always against a DBMS table or view. So in-memory relations, etc. look difficult to implement"].

Spot on and great catch. I missed that.

> I'm unconvinced that there are such problems that are very serious

Let me help you - simple sqlNG commands could transpile into more much complex normal sql target statements, breaking the optimiser. That's a very real risk.

Good post though

cryptonector · on Aug 27, 2022

> Let me help you - simple sqlNG commands could transpile into more much complex normal sql target statements, breaking the optimiser. That's a very real risk.

Breaking the optimizer is a big deal, yes. I have this fantasy that we could have an out-of-band hinting system for SQL, like "start the query plan with table source <alias>", or "use index <index-name> for table source <alias>", or "materialize table source <alias> and create these indices for it", or "index the CTEs by..." (the last one should really be automatically done by the optimizer, but when you can save it thinking time, that can be a good thing). In-band hinting has generally been problematic. Out-of-band hinting could either follow the query itself, or be provided in a SQL exec API as a separate argument, or as explicit method calls in query setup in LINKQ-like systems.

> You may disallow nulls stored, but anything like a non-inner join will produce nulls and then you're almost back to where you started.

Now that just points out that the problem arises naturally and isn't entirely SQL's fault. Null values are as much a problem for SQL as for C and many other languages.

The way modern programming languages deal with nulls is by insisting on full pattern matching on algebraic data types. So it follows that one might want to do the same in a new query language, and/or possibly as an extension to SQL. So that this would be OK because at the top-level, but not as a subquery:

  SELECT a.name, b.thing
  FROM a LEFT JOIN b USING (id)

and one would be forced to check for absence (like nullity, but with the requirement that one must check every time it's possible that a value is absent). In SQL we have `coalesce()`, `IS [NOT] NULL`, `IS [NOT] DISTINCT FROM NULL`, and so on, so we can do this pattern matching, but we're not required to, and that might be the real problem with SQL and NULL values.

> Good post though

Ditto.

zasdffaa · on Aug 28, 2022

This is getting even more interesting.

> I have this fantasy that we could have an out-of-band hinting system for SQL,

I think not breaking the optimiser, and out of band hinting appear somewhat orthogonal but I'd be very glad to be wrong. I'm looking at OOB hinting for a different reason, but what you're thinking is very close to mine, and I'm struck by the fact you see hinting the same way I do (actual, and extensive, logical vs physical separation, inc. indexes as hints).

> and one would be forced to check for [nulls]

Even more interesting. I will shortly have something of relevance. Do you have a contact e-mail?

cryptonector · on Aug 28, 2022

> I think not breaking the optimiser, and out of band hinting appear somewhat orthogonal but I'd be very glad to be wrong.

If you're writing a transpiler, you might as well make it an optimizer and use a solid hinting system to guide the target RDBMS' optimizer so as not to be at its mercy. If you were implementing from scratch rather than as a transpiler, you'd have to write an optimizer anyways.

Also, the reason optimizers can suck is precisely that there can be many ways to plan a query.

E.g., if you have 10 table sources INNER JOINed in sequence, you could start the query plan with any one of them, but not every one of those 10 tables will yield a good query plan when used as the starting table! You want to start such a query plan with the table that has a PK or index key that is most fully determined by the WHERE/ON constraints, by which I mean: the table source that will have the smallest cardinality given the constraints imposed on it by the query.

Sometimes the query that one has to optimize is a sub-query deeply buried in another, possibly in a VIEW, and the hinting needed may be specific to the context outside the view. Now inline hinting leads to having to inline sub-queries (repetition).

The many table sources INNER JOINed case is one I've actually run into.

Part of the problem seems to be that pushing constraints is a fairly advanced sort of query algebra: you need a good AST in order to be able to do it, and even then, it's not all you have to do to build a decent optimizer. If you look at the SQLite3 optimizer, it's gotten much better over the years at pushing constraints.

With every table source named, you can address hints about them, and you can do it out of band.

That or pick a target whose optimizer you can either work on or already is good enough for the transpiler's purposes. However, keeping things general is a good idea, especially when it comes to SQL (there's so much incompatible variation between SQLs!), so as to keep the transpiler generic and portable. The lack of a standard and solid hinting system is a real problem. Also, sometimes optimizers get worse.

> Do you have a contact e-mail?

  ${HN_username} @ gmail

ErwinSmout · on Aug 27, 2022

SQL (the language that is, as defined by ISO) has COLLECT and UNNEST operators which come reasonably close (not 100% of course) to "Date's concept of relation-valued attributes". Moreover, I'm told those two operators are supported by at least one of the three big dogs.

zasdffaa · on Aug 25, 2022

1) yeah, probably.

2) Supose you're writing sqlng, compiling it to sql. sqlng simply mandates a primary key on everything. Not a problem.

3) nulls can be a horror but the alternatives aren't pretty. Removing nulls means tables have to be broken up, see https://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without.... I believe there are better ways to handle this issue.

(final line) Makes no sense, a tab;e doesn't have to be disk-resident. Once read, with sufficient memory, the file sits in ram cache.

cryptonector · on Aug 25, 2022

I second u/zasdffaa's question. What behaviors does this eliminate? I'm sure it eliminates some, but I want to hear which ones you're interested in.

webmobdev · on Aug 24, 2022

Agreed - my experience with non-SQL tech (ORMs) with Django and ASP.net MVC made me really appreciate SQL so much. Most of the time I felt everything would be so much easier using raw SQL instead of dealing with model objects. It also felt like in their quest to replace SQL and make things "simpler" they were recreating some db features again.

3pm · on Aug 25, 2022

It can depend on your perspective. If you look at an app and just see tables and data then yes, model objects stand in a way. If you describe your app as this thing that manipulates the database then I can see how a layer of abstraction can be annoying. But people write 'billing apps' not 'data manipulator apps', so there is this pesky business logic. ORM helps you separate business logic from data access logic. If you don't need this separation then it just stands in the way. ORM is not perfect, but most abstractions leak to some extent.

I think it also depends on how the 'model objects' are implemented. Is it Active Record or Domain Model? I think most of the complaints about ORM are actually complaints about Active Record / DAO, or just a crappy implementation. Also nothing wrong with using SQL in an otherwise ORMed application. Just harder to unit test, refactor etc.

SoftTalker · on Aug 24, 2022

Yep. That's why I write most of my logic in stored procedures. Working with tables and queries is so much easier in PL/pgsql than dealing with ORMs and their leaky abstractions.

My application code just calls stored procedures. It's unaware of the tables and underlying data model.

3pm · on Aug 24, 2022

There is no one-size-fits-all, but most of the time I would be against SP because:

1) SPs usually mix persistence concerns with business logic. Making it harder to understand business intent. I find objects much more expressive than raw data. Sometimes you want to concentrate on the plain logic, without worrying about how something gets saved. Also you will not have to rewrite everything if you ever want to change how something is stored. Sometimes people switch from mysql to pg, or even doc or kv database. Keeping business logic separate enables this. And good ORM enables this separation.

2) Refactoring tooling and unit tests. SPs are lacking here significantly compared to general purpose languages.

3) Business logic outside of DB allows easier horizontal scaling.

SoftTalker · on Aug 25, 2022

In my experience, the application framework changes much more often than the database.

PHP, ASP, ASP.NET, Angular, React, Django, whatever. Pick a year, pick a framework. The database stays steady.

3pm · on Aug 25, 2022

I agree. But I found that approaching it _as_if_ database will change makes for a more focused model. Same goes for UI frameworks. Basically thriving toward Hexagonal architecture, without being too dogmatic.

sivers · on Aug 24, 2022

Cool! I do this, but I haven't seen anyone else do it. Is any of your code public?

I wrote about it at https://sive.rs/pg

and posted my SQL shopping cart at https://github.com/sivers/store

Please contact me if you'd like to share tips: https://sive.rs/contact

SoftTalker · on Aug 25, 2022

No I don't really have any public code.

I read your blog post and I agree 100%

The database is the easiest place to code business logic that pertains to the data. Write it once, it's available for all client applications.

strbean · on Aug 24, 2022

COBOL also fits this description, down to the awkward attempts at matching natural language.

I think the staying power of SQL comes from its broader audience - programmers, analysts, and executives all use it. It's much easier to motivate programmers to learn a new, superior language than it is to motivate executives to learn a new technology that only gives ergonomic improvements.

tabtab · on Aug 24, 2022

Many of those complaints seem theoretical. I like to focus on practical concerns. The biggest problem I see is that the SQL language has grown too complex. It's related to the "Lack of Orthogonality" problem mentioned in the article, but I see different solutions. SQL is not based on combinations of simpler concepts, but hard-coded keywords. But how to orthogonize (factor) it gets into philosophical differences. My favorite alternative is an experimental language called SMEQL (Structured Meta-Enabled Query Language):

https://wiki.c2.com/?TqlRoadmap

It's more API-like to reduce the need for specialized keywords. And you can "calculate" column lists via queries instead of have to list columns. For example, if a table has 60 columns and you want to SELECT all 60 minus two columns, you can't without listing all 58. With SMEQL you can use a query to return a list (table) of those 58 and feed it to the equivalent of a SELECT clause.

Things like CREATE TABLE are fed a table if you desire so you can "compute" your schema. You can generate entire schemas from data dictionaries. Anything that can be done with a table is. You can create in-line (virtual) tables if you want it command-driven, but it's still "using tables to do everything". You can use textual code to create such virtual tables or a table editor as it fits the situation. SMEQL is as close to "everything is a table" as I've seen. Even your FROM list can be table-ized. I used to do similar with dBASE/xBASE, and it was really nice, especially for ad-hoc work such as one-off research requests.

And as somebody mentioned here, null handling needs a serious revisit in SQL. I could rant all day about SQL null handling, especially for strings.

ErwinSmout · on Aug 27, 2022

"Many of those complaints seem theoretical. I like to focus on practical concerns."

The only reason ever why people engage into theory, is precisely because of a deeply rooted desire to address "practical concerns". Hawking literally stated that his aim was to understand everything about the entire universe.

People who "like to focus on practical concerns" see an apple fall and (after they've seen it happen often enough and never ever otherwise) ultimately conclude that "apples fall and that's just how it is". Newton saw an apple fall and ultimately unveiled the concept of gravity. In doing so, Newton also unveiled the (admittedly extremely exceptional) circumstances in which it might be possible for us to see an apple "fall upward into the skies" after all.

tabtab · on Aug 31, 2022

Many "overly educated" (for lack of nice way to say it) just seem to often waste time on idealistic issues. It would take a while to give a decent example, but they just seem to not understand the business world and capitalism. The Newton example is not applicable because it's usually not about discovery, but weighing tradeoffs. I don't mind their personal R&D, just don't make the customer pay for or wait too long for the R&D. After work, they can invent all they want.

asah · on Aug 24, 2022

I agree!!! this is my pet peeve as well, and I sometimes fantasize about ripping into PostgreSQL and adding column-minus.

One wrinkle: computed columns would interfere with query optimization. That said (and here I speak heresy) there are times when syntactic convenience trumps performance.

tabtab · on Aug 25, 2022

Meta ability (dynamism) will generally have an average performance penalty. If you want speed, you hard-code more up front. Meta-ability generally is best for ad-hoc and short-term uses.

One performance trick is for the query optimizer assume the calculation will be the same as the last time the same query or sub-query was issued, and simply dump the results and start over if that assumption is false. If it keeps having to dump, then it assumes you are editing a lot and stops guessing for a day or so. (You could issue an optimizer command to resume guessing if need be.)

cryptonector · on Aug 24, 2022

The syntax could be "* (minus, columns, here)".

adius · on Aug 24, 2022

PRQL (Pipelined Relational Query Language) to the rescue!

https://prql-lang.org

cbm-vic-20 · on Aug 24, 2022

> "we didn’t realize how truly awful SQL was or would turn out to be (note that it’s much worse now than it was then, though it was pretty bad right from the outset)."

I wish the "truly awful" stuff I come up with was 0.1% as successful as SQL.

ErwinSmout · on Aug 27, 2022

And Tony Hoare probably wished he hadn't made his "billion dollar mistake".

But at any rate, you could also try and ponder how unbelievably exceptionally awful it must have been that existed before SQL if something as truly awful as SQL could still be as successful as it has been.

thomas536 · on Aug 24, 2022

I once read an article about SQL and how reordering the sections of a query would make it more ergonomic for users (iirc things like specifying what you want first and then how to present it last, more like a pipeline). I searched many times over the years but have not been able to find it again.

colinmcd · on Aug 24, 2022

Perhaps this one?

https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with...

thomas536 · on Aug 25, 2022

Perhaps? I remember the article arguing more comprehensively why a different order would be beneficial, but maybe I am just misremembering

devin-petersohn · on Aug 24, 2022

I always appreciate blog posts like this, there are obviously cases where SQL shines, and in part I think the dataframe abstraction helps with filling a lot of the missing pieces that SQL doesn't handle so well (composability, debuggability, interactivity, etc.)

Even pandas (with all its faults) is more flexible as a language than SQL[1]. I'm of the opinion that there's a better together story in the end, but I guess we will see.

[1] https://ponder.io/pandas-vs-sql-food-court-michelin-style-re...

RyanHamilton · on Aug 24, 2022

qSQL based on the concepts of ordered lists is more appropriate for many queries, examples available here: https://www.timestored.com/b/kdb-qsql-query-vs-sql/

Kdb the system that qSQL is ran within, allows full use of variables and all builtin functions with tables/functions/variable/columns. It really is a case of less is more. What this allows is functional form queries. Imagine being able to query: ?[tableName;lessThan(age;10)] and have perfect functional form representation for all queries. No ORM, no string concatenation. It seems some other database creators are at least becoming area of these things and integrating parts.

cryptonector · on Aug 24, 2022

I really never want to assume order in tables. That's because there could be many orders of interest, and that's one reason to have multiple indexes. And also because for a table, the order of rows can be hard to guarantee. Obviously something like qSQL would try hard to guarantee table row order, and that's great in the cases where qSQL is useful, but more generally it's going to impose on how you work with the database.

On the other hand, SQL is set-based. But being set-based is weird in the world of computers because in memory everything is ordered. Sets are a fiction in programming, as they're always ordered in some way, and the set abstraction can only try to hide that order.

And that order can be very useful.

But there can be many orders that can be useful, but only one in which things are stored in memory -- the others can only be extra indexes.

So the general purpose thing (SQL) has to be set-based, offering explicit ordering, and taking advantage of actual order for optimization.

dominotw · on Aug 24, 2022

SQL is having somewhat of a moment in the bigdata world, thanks in part to 'modern datastack' and new age datawarehouses like snowflake,bigquery.

However there are a lot of pushback from 'traditional' dataengineers who were trained on spark/scala. Its bit of hardsell to go from a highly typed language to a free for all text based logic.

I think the following is needed for sql to be finally accepted as 'serious' contender.

create compiled sql language ( not pandas)

1. that compiles to sql and addresses some of the issues bought up in the post like nested aggregations.

2. make code reusable. Eg: apply year over year growth to a table that has the requisite columns. Compiler should check this in ide.

3. make materializations first class concept in the language. No seperate dbt layer.

4. crate a way to package and distribute libraries that you can import into your project .

5. a unit testing framework that makes it easy to test the logic without having to setup test tables in the database.

eatonphil · on Aug 24, 2022

> I think the following is needed for sql to be finally accepted as 'serious' contender.

Whatever way you slice it, SQL is one of the most used languages today [0].

[0] https://spectrum.ieee.org/top-programming-languages-2022

dominotw · on Aug 24, 2022

right. I get lots of pushback for using sql at my clients. They just defeat me with one single point. "where are your unit tests" :D

ericHosick · on Aug 24, 2022

> where are your unit tests

I do unit testing in SQL, and something I'm working on and use extensively myself (https://www.npmjs.com/package/sql-watch) indirectly supports unit tests.

There are also SQL testing frameworks available.

dominotw · on Aug 24, 2022

Yea I do that via dbt by setting up a mock data tables in database and using a macro to use those as sources/refs when run in test mode.

However what we are doing here isn't 'unit testing' its a black box integration testing. When I write equivalent code in scala, i just test the logic via unit tests,

eg: logic to filter out some orders that don't qualify, I extract method in scala code and just test that logic as part of development lifecycle. There is no dependency on a database.

Analog here is using selenium or some ui testing framework to test if button turns blue if order exceeds limit. Thats not unit testing.

ericHosick · on Aug 24, 2022

> I extract method in scala code and just test that logic as part of development lifecycle. There is no dependency on a database.

Unit testing seems to depend on where the unit of code is which is being tested. At the middle tier, you may mock out parts of the code so the tests aren't reliant on external sources (apis, databases, libraries, etc.).

It seems that unit testing database code would happen at the database layer: it's still a unit test as the test isn't dependent on external sources.

dominotw · on Aug 24, 2022

> It seems that unit testing database code would happen at the database layer:

Its not database code though. "Orders over 100$ should be marked as vip" is domain logic.

Database code like interactions with database, connection pools, primary/secondary switching ect yes they should be tested with a database.

taeric · on Aug 24, 2022

I was going to say the same. With the prevalence of embedded databases, and how cheap it is to stand up a container with non-embedded options, build time testing of queries has never been easier.

world2vec · on Aug 24, 2022

Dbt to the rescue!

vendiddy · on Aug 24, 2022

I think these are great suggestions.

It seems like you're suggesting that someone could design a functional-style programming language that compiles to SQL.

2 & 3 are my biggest pain points. I can't just extract functions like I can with a regular programming language. Instead, SQL queries get increasingly complex with no great tools to manage that.

For 3, products like https://materialize.com/ look interesting for being able to create derived materialized views that can efficiently be kept up to date.

alphanumeric0 · on Aug 24, 2022

I'm not familiar with Ecto, dplyr, or DBT, but I would love an ML-like language to replace SQL. I'm imagining being able to pass a table (or any table-oriented data, like a sub-query) to functions that would type-check columns and would return table-oriented or scalar data. I'm not sure if this is actually possible in practice, but one can dream.

For instance, a "top 10" function that could be re-used on any table (apologies for my pseudo types and code):

  selectTop10 : Column -> Table -> Table
  selectTop10 orderByColumn table =
    SQL.selectAllFrom table
    |> orderDescBy orderByColumn
    |> limit 10

  limit : Int -> Table -> Table
  limit n rows =
    SQL.limitBy n rows

  orderDescBy : Column -> Table -> Table
  orderDescBy orderByColumn rows =
    SQL.orderBy [orderByColumn] SQL.Ordering.Desc rows

chrisjc · on Aug 24, 2022

Not 100% sure about what you're suggesting, but wouldn't it be easier to pass your functions to your table/sub-query?

And that's exactly what you're able to do in most of the modern data warehouse services such as Snowflake. Inferences can be contained within internal/external user defined functions.

This is very reminiscent of made the big-data/map-reduce movement so notable, sending your query to the data instead of moving your data to the query. Sending your model to the data, instead of sending the data to the model.