I really like SQL, and think its one of the most important languages a dev needs to learn (all you managers, you could learn some SQL too!). However to syntax is often very restrictive.
As an example why would WHERE clauses need to have a AND instead of multiple WHERE clauses. When working with big tables, its super annoying to select all but one column. So why not have syntax for this? Eg SELECT !username from users could expand to all columns without username.
I feel SQL has so many low hanging fruits from a user perspective, but its very slow moving, and the standard is even more slow to make progress.
So SQL with sugar that compiles down to vanilla SQL has a market imho. I will study PRQL more and see if i could use it on a daily basis.
> Eg SELECT !username from users could expand to all columns without username.
This should be landing in PRQL soon. See ongoing discussion here:
[an exclude clause, that would select all columns except whats specified #172](https://github.com/PRQL/prql/issues/172)
The syntax we seem to be converging to is like you suggested:
select ![foo,bar]
would select all columns except "foo" and "bar".
The aim is to support this at least for SQL dialects that support it like DuckDB, BigQuery, ... For other dialects we might just error out initially.
There is a goal of supporting working connected to a live database in which case the source table schema could be read and the appropriate SELECT clause generated. This would not be robust to schema changes such as future column additions though. At the moment the SQL is generated in an offline setting though so this is not possible.
If you are a PRQL dev i have a suggestion for you/your team.
Make PRQL language agnostic. In a real life scenario many people are writing SQL not in the app layer, but more ad-hoc for stats, reporting, analysis, examining the data etc.
What i mean by this is:
If i write an app that has embedded SQL i dont really care to much for the possible verbosity. This code will be used more than once, so im OK with taking the extra effort.
Also all languages has an query-builder / ORM so the benefit of something like PRQL is possibly not big enough to merit it as an additional dependency.
My suggestion:
Make PRQL a cli tool that can be used by allowing users to connect to a database in a similar fashion as something like usql (https://github.com/xo/usql),
I would find the most power in a tool like PRQL is this setting: Ad hoc SQL queries.
This would open the editor integration. With a tool like this i could open vim, and write PQRL in a window, and pipe it via something like Slime to a open PQRL session.
So something like a language server, but connected to a database and able to execute queries? That's an interesting idea.
While working on the compiler, I try to separate different components so things like this are possible. Currently, we are focusing on the language itself and also put some though into intermediate representation of the query (RQ - relational query). We are conservative about expanding the scope, because building database connectors and execution engines would take too much focus off the language and the compiler.
But I'd lie if I said that I don't want to dig into it.
But before that i would put some effort into the actual "repl/session" -like tool. Pipe a PRQL query to a PRQL session, and then the PRQL translates that to SQL, and returns the underlying response from the database.
If done right, this would open the possibility to build all sorts of cool editor plugins, that could directly interface with the PRQL session.
IMHO this is where a tool like PRQL would shine, as it would make writing SQL more enjoyable and way faster, but at the same time being both language and sql-dialect agnostic.
Seems a natural fit for a notebook UI. If a PRQL cell doesn't start with "from," just continue adding filters to the pipeline above. Would let you progressively build pipelines by adding filters and derivations, while previewing the data each step along the way. Split a cell to debug a pipeline at any point.
Thanks for the suggestion. I don't think I knew about usql. I completely agree with you and have been working on a cli tool called `prql-query` or `pq` at the command line:
In BigQuery you can specify columns to include as well as exclude. For example:
select s.*, t.* except (s.foo, s.bar, t.blah)
from s, t
where ...
I've found this extremely handy for avoiding duplicate column names when joining tables. For example, trying to join two tables that both have an "id" column can be really tedious without this syntax.
The column list for a table is available in the information schema, just build the query with the expanded column list, minus the ones passed in to ignore on the DBs that don't have native support. Using the Information schema makes it DB agnostic for "free"
SQL was meant to be like readable english, so two or more WHERE didn't made sense.
Also select all columns but one may not be future proof is column are added or removed, you really don't know what columns that query is going to read, useful for oneshot query, but not much more.
Meh, thats not a valid argument. What happens if i `select id, name from users` and later someone drops the name column. Similarly if i `select * from users` and use name the same problem/bug arises. Dropping a column without the application code refactor is obviously a developer issue, not a SQL one.
You're right that it's a "dev/dba problem", but the reason for not using `select !<column>` is exactly the same as not using `select *`. If you specify exactly what you want, there's less of a chance of your ORM-based app breaking because it received an extra or too few columns. I'm sure most people who work with SQL DBs have been bitten this before.
Also, while it could also be useful, over the years I've started moving away from things like using `!` to negate things. It makes it slightly harder to read and it ruins the symmetry of my code. And lord knows SQL doesn't need any more of that. It's probably the only language I use often where I have to place my commas at the start of the line; where I need to add a first filter as `1 = 1` etc just to make it easier to comment out things quickly.
What I personally want most is the ability to move the FROM clause to the top (this is a no-brainer) and better aliasing rules. The alias I used in my SELECT should be accessible everywhere in the statement.
Select * is something no one should be doing in real code.
Select [list of columns] should the way to query and doing so you know exactly what columns are read. That's one of the reasons I dislike ORM like hibernate.
I advocate the least surprises principle, select * or select !column are doors to surprises. What if someone add a geometry column with tons of data (real example happened to me)? You are going to read tons of data that probably you don't need.
The query itself does not matter. If i delete/rename a column the app WILL break no matter how much i would follow best practices.
I dont see `select !name from users` any worse than `select name from users`. In both scenarios the app will break if the name column is renamed/deleted.
To me the biggest thing against select * is readability. Basically you are obfuscating what you are doing and getting from the database. In trivial queries it does not matter, but when you read more complex procedures, it starts to be a bit annoying.
I think there are pros and cons to both approaches:
- Explicitly spelling out the column names has the advantage that you can search your code base for that name and see where it's used.
- When declaring a view which filters the rows in a table, using SELECT * would simplify maintenance, as you wouldn't need to update the view when new columns get added to the table.
But delete/rename columns are not the only operations, nor the most common. Adding columns is the most common, and there only one of those options breaks
DROP TABLE IF EXISTS and CREATE TABLE IF NOT EXISTS are counterexamples to that statement. Notice that suffix operator IS NULL is present in expression part, but these statements do not use that kind of grammar.
I was talking about original idea, current Sql evolved and in many ways is a monster, but multiple WHERE instead of AND operator make little sense to me
Sure, multiple ORMs and query builders allow this, but when writing pure SQL thats dynamic its annoying to have to check for some condition before knowing "Should i use WHERE or AND next". This is traditionally solved with an hard coded WHERE 1 = 1 in the beginning of the query. Still its a wart, even tho a small one.
Point being, there are many other small things like this that (can) make SQL a burden, eg the select all but <n+1> columns.
> why would WHERE clauses need to have a AND instead of multiple WHERE clauses
Semi related. Python's for comprehensions allow multiple if's.
some_list = [ x for x in things
if x % 2 == 0
if my_function(x) > 0
if some_other_global_predicate()
...
]
And this is shown in the python docs.
I've had PR's rejected for using this since they wanted "and". Drove me up a wall as this IMO reads as easily if not more easily than the "and" version, and (again, IMO) is more for humans where the "&&" is more for the computer.
> When working with big tables, its super annoying to select all but one column
What's the use case for that? In fact, I can hardly understand the benefit of projecting at all, except at the very end. Why would you want to do it? Is there a performance benefit?
I would love this addition in SQL. My work is largely exploratory with SQL. 99% of my queries will never get saved anywhere.
At the end of the day it's just syntactic sugar, but sometimes when writing reports you sometimes don't want to include a single column like a sequential key, because it's not relevant information to the end user or exploration.
Another reason is that some text columns are long enough that just absolutely destroy readability in things like psql. Often I just want to get a feel for what's in the data and removing those annoying text fields to make psql readable is a pain. You have to run a `select * from table limit 0`, copy that into a text file, then `s/ +\| +/, /g` the columns list, then remove the annoying field... then throw it back into your query. Takes about 45 seconds to do all that and that adds up!
Right, I understand why you'd want to project at the very end of the query, i.e. suppress some columns from being shown. Other than that, I don't see the point, unless there's a performance impact. Keeping columns around doesn't actually change the result of the query in any way.
You've definitely missed the point because you're concerned about the "very end of the query". I'm describing both a reporting and more importantly as far as I'm concerned an explorative process that's disrupted by the silly need to list out all columns. Performance is irrelevant.
Yes, I know I've missed the point. That's why I asked my question in the first place! My question was:
> What's the use case for that? In fact, I can hardly understand the benefit of projecting at all, except at the very end. Why would you want to do it? Is there a performance benefit?
and I meant it literally. It's not an oblique, rhetorical way of saying "you don't need to do this". It was a genuine question!
You've answered that it's convenient when doing exploratory work. Thank you. That helps me understand better.
Having written lots of advanced SQL (tons of CTEs, window functions, user defined aggregates, json manipulation, etc.) in my previous job, this looks really promising.
The main issue I encountered while playing with it is that the ordering semantics of transforms are unclear to me. This is a notorious foot gun in SQL which does not propagate row order from sub queries or CTEs into parent queries unless those sort conditions are restated. Hopefully prql can come up with a better answer to this problem: https://github.com/PRQL/prql/issues/1363
I'm not sure I'd call that a footgun -- it's a fundamental property of relational algebra that relations are unordered.
That said, it's not universally true that subquery ordering is not propagated to the final result; in the absence of an ORDER BY clause, the query engine will return the results in whatever order is most convenient for execution. For example, this query:
select *, rank() over (partition by lastname order by firstname)
from employees;
will definitely return a result set ordered by lastname,firstname even though it's not explicitly specified (tested on MS' azure sql database).
Very much agree! We're very friendly with the folks at EdgeDB, have lots of overlap of contributors. They're part of the inspiration behind PRQL [1].
To focus on the differences for a moment:
* PRQL is focused on analytical queries, whereas the genesis of EdgeDB was on transactional queries (though I know Yury & the team are building out more analytical capabilities)
* PRQL works with any existing database — you can query any tables you have today — and using PRQL doesn't preclude you or your colleagues from using SQL. In exchange for that openness & simplicity, we give up some functionality — we don't yet know the types of all your columns, for example. TBC, EdgeDB has improved on this dimension since their initial product — it's been able to run on a vanilla Postgres instance for a while now, no extensions needed.
Nice. Pipelines are lot more intuitive than sql, and the other added niceties were desperately overdue. Funny considering sql is supposed to be human readable hehe. Feels a lot like MongoDB aggregation pipeline, but with a more integrated syntax rather than its verbose chaining of json values.
New language is a double edged sword though, a special syntax will be much more difficult to compose from other programming languages, without resorting to string formatting and all the injection vulnerabilities that opens up for. With json-objects or chainable functions it’s easier to programmatically build queries.
What’s the strategy for composability? Related, maybe some times conflicting, requirement, can type safety with host language be possible?
SQL has many problems: "from" should come first to help auto-complete; null handling is complicated; the syntax in general is weird. Compatibility is a problem (I have implemented some relational databases in Java: Hypersonic SQL, H2; compatibility with other databases is hard).
There is little innovation in the database space: there are hundreds of general-purpose programming languages, but very few "database" languages. I also tried to specify a new language, https://newsql.sourceforge.net/ many years ago. There's GraphQL, but more innovation in this area would be great.
Many developers end up writing huge SQL statements (one statement that is hundreds of lines). You wouldn't do that in a "proper" programming language; but in languages like SQL (or maybe Excel) this will happen.
Another problem is proper encapsulation. Views can be used, but often developers have access to all tables, and if you have many components this becomes a mess. Regular programming languages have module system and good encapsulation. SQL statements often don't use indexes, for one reason or the other. With regular programming languages, it is harder to end up in this position (as you have to define the API properly: encapsulation). Regular programming languages don't require "query optimizers".
SQL is used to reduce network roundtrips. Clients send SQL (or GraphQL) statements to the server, which are actually small programs. Those programs are executed on the server, and the response is sent to the client. (Stored procedures and for GraphQL persisted queries can be used - but that's often happening afterwards.) Possibly it would be better to send small programs written in a "better" language (something like Lua) to the server?
> Regular programming languages don't require "query optimizers".
Isn't that what llvm does? Or GCC? Or a JIT?
> Possibly it would be better to send small programs written in a "better" language (something like Lua) to the server?
I'd certainly like something better than SQL, but going for an imperative, Turing complete language seems like the opposite direction of where things should be going in this scenario.
> Most tools allow you to jump down to low level stuff
Eeehhh, kinda. Sure, a compiler will let you just write some assembly by hand, but it won't let you write a optimization step, or even directly configure it. You have to tweak your code or the compiler flag to give the right hints so the compiler generates the binary code you want. Doesn't look much different than a query planner.
To make a loose metaphor, if SQL is like high level opengl, then I want the low level Vulkan and shader APIs for hand optimization.
I would like this especially for Apache Spark sql, since a bad plan that isn't aware of the data shape can blow gigabytes over the network unnecessarily.
Some databases like ArangoDB (https://www.arangodb.com/) allow you to use Javascript instead of SQL.
However, using a type-unsafe (read weak typing), turing-complete language introduces the usual problems we know and love, such as infinite loops, runtime type errors, exceptions, and the like.
Personally, I'm looking forward to a WASM runtime for databases -- so we can run webassembly on the database. This COULD be carefully designed to be statically checked and, possibly, make it really hard to write runaway loops.
Many databases support other languages as well (eg. PostgreSQL supports many including Python, by default). One challenge is lack of standardization. (SQL is a weak standard, but at least a standard).
Weak typing: what about TypeScript?
Slow loops: yes, this is a problem. However, SQL (and even more so, GraphQL) also has a problem of large results / operations spanning too many entries. During development, the number of entries is fine, but not in production. Specially if indexes are missing, this is also a problem in SQL. (Endless loops are actually less of a problem than slow loops: it's easier to detect endless loops during development).
To process large results in a good way, often, pagination is needed; best would be keyset pagination. What if a statement returns a "continuation" statement in addition to the result set? If the client wants to get more results (or process more entries), then he would need to run the "continuation" statement.
Say a database doesn't provide SQL, but instead a set of low-level APIs (plus a ton of tools). Developers would then (be forced to) write properly modularized, versioned code on top of those APIs.
These are all really great complaints; and yet I still prefer writing plain SQL to using ORMs and their nifty tricks. I think the main reason is that I don't ever want to have a lot of SQL logic in my code. Short queries (say, 10 lines or less) are okay, if they're one-off and pertain to the function's logic and are never repeated by other pieces of code. But writing complex queries through an ORM layer means they're going to be in code -- and harder-to-maintain -- than if they were where they should be: In views, functions and procedures. ORMs also obscure the execution plan and make optimization more difficult, which in production isn't something I really want to worry about.
Writing SQL is like uploading a program; just like a shader to a GPU. And if the network stopped being the bottleneck, the DB server's load would become the problem. Just like with a GPU, you want to upload your programs and data as infrequently as possible. But unlike a GPU, you have full flexibility to permanently store your logic right there in your SQL DB. If it's well-documented, it's much more streamlined to write views and views-on-views, than to keep all that DB logic in your code. I don't find that it becomes a mess at all, because properly crafted views and procedures strongly encourage forced indexing and checking execution plans when you refactor them; as opposed to in-code, ad-hoc queries where it's hard to know if someone ever ran tests to optimize them.
Some statements do need to be very large, for efficiency. Others are better being broken up into sub-procedures, temp tables or even roundtrips. A great query is a work of art where each part of it is optimized not just for execution plan but also for read/write, network state and expected load. What I like about SQL is the flexibility to decide which way to go, and prioritize based on how often something will be used and what kind of strain it'll place on each of the layers in the stack. As with anything powerful, there are a million ways it can be abused. But what I've found in the ORM space and also with noSQL solutions is that simplicity of use comes at the price of flexibility and control.
[edit]
>> null handling is complicated
This would probably also be my biggest complaint about the language itself; the existence of IS NULL / IS NOT NULL and both null-safe and -unsafe comparisons gets confusing (and especially painful if you're trying to write a parser). I think it's one of those things like `undefined` vs `null` vs `false` in Javascript where, on second thought, it's just too late to fix it without breaking half a billion websites.
> I don't find that it becomes a mess at all, because properly crafted views and procedures strongly encourage forced indexing and checking execution plans when you refactor them; as opposed to in-code, ad-hoc queries where it's hard to know if someone ever ran tests to optimize them.
I have to provide caution here, when you create views and stored procedures, I've always found it a mess to maintain:
1) VCS for stored procedures and views is usually non existent. Good luck understanding how these change over time and who changed them.
2) Deploying application logic alongside these systems is very painful, because unless you've versioned your views/stored procedures, when you make a breaking change you need to stop the entire application, make the change in the DB, and restart the entire application again on the new version. Zero-downtime deploys would require versioned views/stored procedures.
3) It quickly becomes a challenge to answer the question "what happens when I do X?" reliably, where X is insert, delete, or even select a row. Once you have complex enough stored procedures, modifying the data changes things in unexpected or unintended ways. A dev unaware of how the system works might assume that inserting a row simply inserts a row, but instead due to stored procedures hidden from application logic, it might cascade into inserting rows elsewhere, deleting rows, or modifying other rows in other tables. Discovering these issues without knowing that they could exist is often done around midnight during a production outage, because a new feature was released that did something in the DB that was presumed safe, but wasn't. If the code for the business logic were in one place, the application, it would been much easier to see what the change would actually do.
I understand entirely that performance gains from good DB use are astronomical, but in my humble experience, I've found them to be more trouble in the long run than they are worth.
> VCS for stored procedures and views is usually non existent
This is a mindset problem, not a technology problem. Treat your stored procedures/functions/views like code, not like data. Keep a separate "code" schema, put your procedures/functions/views definitions in files, and store your files in Git, just like Java or Ruby code. Deployment then becomes an automated drop and recreate of the code schema in a single transaction with zero downtime.
> Deploying application logic alongside these systems is very painful
This is not my experience at all. The stored procedures form an API. If changing the API breaks your application, you are doing something fundamentally wrong. API versioning should only be necessary if third parties depend on your API, but I wouldn't recommend giving third parties direct access to the database anyway.
> Once you have complex enough stored procedures, modifying the data changes things in unexpected or unintended ways
I assume you mean triggers because stored procedures don't happen as side effects, you have to call them explicitly. Regarding triggers, I agree with everything you say.
Re: VCS, I’ve found that with a small amount of setup, tools like Liquibase[0] allow you to maintain functions and views (with “CREATE OR REPLACE” or equivalent) as SQL files in a file system with full VCS, diffing, etc. just like any other code.
You still need to sequentially define your changes (migrations) which isn't exactly the VCS experience people are used to. If all changes to your db require new file entries, rarely do you need to diff let alone bisect to find out the history. It does make it harder to find out the current shape of your app though with all the ALTER statements to a single table strewn about multiple files. I wonder if it's feasible to write a tool that allows you to generate sequential migrations from DDL organized like tradition code with modules and items dealing with the same domain in the same file after diffing it with the sequential migrations already in the codebase.
No you don't have to structure your project that way. You can have all your logic (views/functions) stay in a single file per, and use git like usual with the standard diff you are used to. Your changelog xml just needs to be setup for it.
So, after digging about in the docs, I found the Best Practices[0] page which lays out what you describe. I was disappointed to find it is just using idempotent creation statements for procedures and views and putting them all in files that are re-run on each migration step-along with each new roll forward migration-so that new versions of these logic objects replace any old ones. This is not exactly something that liquidbase provides, should be easily replicatable, and I was hoping it'd do diffs to provide similar experience on tables as well.
After some web searching, I came across a project[1] that's trying to do what I describe but it appears to be dead. I'm surprised that migration tools (that I looked through) don't already support such a basic approach that I suspect I'm missing something obvious. Some fundamental block in how SQL works or maybe it doesn't work that well in practice for, in concept, it sounds easy to script using what they call a shadow db and `pg_diff`.
Oh, ORMs are not the solution. I agree, ORMs often bring new issues (like, loading too many entries in memory), and don't solve many problems (encapsulation).
> I don't ever want to have a lot of SQL logic in my code.
That works if you are a small team. With big teams / multiple teams, the small problems become a huge problem: someone / some team will write huge statements, will use weird syntax, and so on. (C++ has similar problems btw) Specially the lack of proper encapsulation can become a huge problem.
> Some statements do need to be very large, for efficiency.
Well, then the problem is that SQL doesn't give you a good way to avoid large statements.
> there are a million ways it can be abused
Exactly. Other languages (like Java) make it harder to abuse the language. If you really want, you still can; but it's easier to refactor and maintain.
> Writing SQL is like uploading a program; just like a shader to a GPU... But unlike a GPU, you have full flexibility to permanently store your logic right there in your SQL DB.
I agree - this is a really interesting way to put this.
Here're little things that would make my life better when I'm generating SQL queries:
1. Support trailing commas.
2. Add operator `is` which works like `==` but for nulls it returns `true`. Could also remove automatic type casting, I'd be happy about that, never liked it.
3. Support trailing `or`-s and `and`-s.
So query could look like
select a, b,
from t
where (a.x is :x and a.y is :y and) or
(b.x is :x and) or
As an aside, avoid OR as much as possible in your WHERE clauses. Adjust your relations if necessary. ORs can really blow the planner up and slow things down. ANDs are reductive wrt indexes. ORs make multiple passes on the indexes necessary.
It's not always possible to avoid, and it makes perfect sense sometimes, but it'd be better if folks didn't lean into OR so much.
So that you can generate SQL more easily, and (for hand-written statements) so that you can re-order lines more easily. Many programming languages support trailing commas.
> There is little innovation in the database space: there are hundreds of general-purpose programming languages, but very few "database" languages.
How many of those are "innovative"? How many of those have an ecosystem around them that can be taken seriously?
There's maybe one innovative language per decade. Minor improvements to an existing language (with a load of unnecessary syntax changes) don't make up for programming in a barren ecosystem. And a bunch of wrappers around C libraries don't make an ecosystem, they don't allow the language to materialize any actual new paradigms and just result in a C dialect.
In programming languages we have imperative and functional languages, type-safe / untyped, memory managed / unmanaged, low-level vs high level, interpreted / JIT / ahead-of-time. So much innovation! And SQL is almost 50 years old, and basically still the same (plus adding features). And now maybe GraphQL. Very little innovation.
I may be in a minority, but I like the projection list being first in simple queries. To help with auto-complete just type the other parts first then fill in the select list after? This falls apart with CTEs though as they have to come first syntactically, so now they exist maybe at the end is the best place for the output list so it is easy to find in a complex statement.
> Many developers end up writing huge SQL statements
Compossibility can definitely be a significant problem with SQL. There are structures that can help significantly, views, CTEs, etc, but they all have either potential maintainability issues or potential performance issues.
Partly, from a performance PoV, this is an issue with the query optimisers. On the maintainability matter I think the syntax for CTEs and sub-queries generally could have been much better thought out – I've seen a few attempts to beautify (or replace) SQL that essentially amount to moving parts around so you can make CTEs & sub-queries look more like functions/procedures in imperative languages (transpiling the result back to SQL for submission to the database after).
> Regular programming languages don't require "query optimizers".
Because regular programming languages are “lower level” in that regard and require you to do that optimisation yourself. The point of a language like SQL is that you describe the shape and content of what you want and don't have to worry about the most efficient way to retrieve that information.
And come to think of it, many regular languages do have optimisers. Think of all the work a compiler is doing or the JIT compiler in JS engines and their ilk – they don't just blindly interpret each line of C/other into a distinct selection of CPU instructions. Consider declarative frameworks for those languages too, like many used for SPAs which deal with DOM manipulation optimisation for you, and the many attempts at no-code or low-code development solutions which seek to abstract such concerns away from the solving of problems that require programming (by doing what SQL attempts: trying to translate what you want, stated as simply as possible, into what is needed to do that efficiently).
> regular programming languages are “lower level” in that regard and require you to do that optimisation yourself. The point of a language like SQL is that you describe the shape and content of what you want and don't have to worry about the most efficient way to retrieve that information.
Yes! But I wonder: wouldn't it be better if the developer has to specify how he wants to retrieve the information? In a regular programming language, typically people don't complain too much about this "missing feature" of a query optimizer. It's about modularization / abstraction. In SQL, the data model is king; in a regular programming language, the API is king. Yes, compilers have optimizers, but they optimize the machine instructions, not (so much) the algorithms that are used (scan vs using an index / using which index).
Maybe I just have seen too many cases where developers didn't create the necessary indexes, or databases did't use the (obviously) best indexes, or developers writing 10'000 lines SQL statements (that I then re-wrote in Visual Basic, and the resulting program was much faster). Or 1000 line GraphQL query that returns 50 MB of JSON. Yes, I have also seen Java developers writing exponential algorithms or worse, and running into memory issues - but I personally found it easier to analyze and optimize then SQL. Maybe I have some bias thought.
> Yes! But I wonder: wouldn't it be better if the developer has to specify how he wants to retrieve the information? In a regular programming language, typically people don't complain too much about this "missing feature" of a query optimizer.
On the other hand, as you noted, SQL is forever while programs and programming languages come and go. Maybe having a declarative data mindset is appropriate since data is more valuable and durable than code.
> > regular programming languages are “lower level” in that regard and require you to do that optimisation yourself.
> Yes! But I wonder: wouldn't it be better if the developer has to specify how he wants to retrieve the information?
I'd be against making SQL lower level in that sense – part of its purpose is to be back-end agnostic and once you start letting the dev get too details you are locking code to one implementation. Of course this isn't how things generally work anyway beyond the basics, with engine specific hints and such, but let us not intentionally make that more of an issue!
Though if we do go deliberately back-end specific, perhaps a way of describing a query plan more directly and saying “use this” is what you are wanting. Something like plan forcing in SQL Server's query store, but with you actually dictating the plan not picking one the engine has created from SQL. That way SQL stays purer but you have the option of something more direct, like mixing bits of (obviously platform specific) assembler into your C or other higher-level code. There will still be terrible code created that way though – possibly much worse.
Low-level languages can be backend agnostic. For example C or Rust allows to write low-level code that is back-end agnostic, and doesn't cause cause vendor or technology lock-in.
What I like to have is guarantees similar to what a RTOS (real-time operating system) has. Maybe using hints, or something like that. This could be standardized. Languages like C or Rust have such guarantees naturally: statements have some kind of guaranteed time constraints (that's why RTOSes can be mostly written in C). Sure, loops can be endless, but loops, e.g. in Rust, can't become 1 million times slower suddenly because the backend changed.
> But I wonder: wouldn't it be better if the developer has to specify how he wants to retrieve the information? In a regular programming language, typically people don't complain too much about this "missing feature" of a query optimizer.
Why not use a key–value store or just a file system, if you want to write everything yourself?
I don't want to write everything myself. (Well, I did write 3 relational database engines, and 4 query optimizers... Hypersonic SQL, PointBase Micro, H2 database, and the Apache Jackrabbit Oak query engine.)
I want a good abstraction (API) that allows to write programs that access and manipulate data in an efficient and simple way. But I argue it shouldn't be a declarative set manipulation language like SQL, but a language that is more low-level and makes it hard to (by mistake) retrieve or update millions of rows, or makes it hard to write statements that are very inefficient, and makes it hard to have systems that lack encapsulation, so you end up having hundreds of tables that are entangled in a big mess.
I do like the fact that network roundtrips can be reduced: you can send SQL (or GraphQL) statements from the client to the server, where the statements are processed and the result is returned. But it's too easy for the result to become large, or statements too slow to be processed. With a key-value store (like Redis) you can't do that. Or can you send a list of commands to Redis (or some other key-value store) that are executed in a safe and efficient way (with guaranteed runtime) and the result is returned? That would be what I look for.
Existing query optimisers are very weak. You can write multiple semantically equivalent queries but some will be orders of magnitude slower. I would pay good money for a Postgres that I could tell to spend several minutes analysing a query before caching the query plan and reusing it for the future, instead of having to spend programmer time tweaking things.
query plans are data-dependent. mssql does cache query plans in some cases and it's a major caveat in using e.g. stored procedures (you have to make sure a good-enough plan is cached.)
What I’m saying is the plans have a tiny budget devoted to their optimisation. For something like a view, I would happily let the optimiser run for hours finding a speedup if it were then reliably delivered. As it is, none of that happens and you often have to rewrite to a functionally identical but faster query. The furthest you can go on something like Postgres is tweaking some config and turning off stuff like GEQO in favour of brute force etc.
This is the limiting factor there. The best plan is going to vary depending on data patterns and (if variables are involved) input parameters. You would have to spend a potentially infinite amount of time testing for different patterns and store which plan is best for each, or do the analysis for every possible parameter each time the referenced data changes.
> plans have a tiny budget devoted to their optimisation
The reason planners get minimal time for planning is that in the general you quickly hit the point of diminishing returns. Their job is to find something good enough and find it quickly. They are helped by pre-computed hints like index stats, so you could perhaps spend more time building other helpful pre-computed data but you would need to reassess this data regularly (as index stats are reconsidered under certain conditions) and again you hit a point of diminishing returns in terms of the amount of time you spend on this as data it modified compared to the potential benefits from better plans.
> For something like a view
Databases do not create query plans for views when the view is created. They create query plans for overall statements that use the views when those statements are executed (or chose a cached one if the statement is sufficiently similar to a previous one for which there is a cached plan).
This is a very pessimistic argument, undermined by the fact that every single day, humans rewrite queries - in the supposedly declarative language SQL - to be identical but faster. I want a query planner that can be told to go away and try to do better with more resources. I don't know why this is unthinkable.
Yes, just like in browsers where one API is faster in Chrome and slower in Firefox or vice versa. Then Safari drops in and completely upends the notion of "best strategy".
The underlying implementation will always affect top level access performance. "Best" will always be implementation dependent.
That said, the DBs I use most (Postgres, MySQL, SQLite, and their derivatives) have pretty consistently improved performance from version to version with surprisingly few regressions.
Also there have been more "knobs and dials" added to get what you want.
WITH foo MATERIALIZED AS …
springs immediately to mind. The ability to set or remove an optimization fence does wonders while still maintaining SQL's declarative roots.
> SQL has many problems:
> "from" should come first to help auto-complete;
Agreed
> null handling is complicated;
Mostly the issue is that null in dbs has different semantics than all programming languages. I wish programming languages used the db style of handling nulls. In dbs, null means that the value is unknown and can't be reasoned about. DB uses nulls in how people reason arguments: "Joe's age is unknown, Jane is 25 years old. Are they of same age? We can't say if their age are equal, or unequal, because we don't know."
> the syntax in general is weird.
I find syntax to be like any time switching a language. Python is weird to C++. Atleast with SQL, I find usage docs more understandable than many programming languages.
> Compatibility is a problem
This sounds like a complaint that two different programming languages are not providing same features. Or two operating systems have different UI toolkits (and different menu conventions, like where should the 'preferences' be under).
I wouldn't expect sqlite to be Oracle / MSSQL / PGSQL compatible.
Not really. A null value in a certain column does not indicate that the field is optional; instead, the table definition having a nullable column indicates that the field is optional. A null value in a nullable column indicates only that the value is unknown, nothing else. This is database modeling 101:
A table ("relation") describes one entity. The attributes of the entity are reflected in the table column set, and each table row represents a different instance of said entity. Marking a column as nullable means that not knowing that attribute is a recoverable situation (for example, because it is only stored for informational purposes and not for processing).
If you're using the same table to store multiple entities, using nullable columns to indicate if a row represents entity A or entity B, you're doing it wrong.
An optional value just means that the value is unknown.
For example, it is optional to enter the colour of the car I own. So my car colour is null. The database can't confirm if my car is Red or isn't Red, because it is an unknown value.
> There is little innovation in the database space: there are hundreds of general-purpose programming languages, but very few "database" languages.
That's one way to look at it, but another way is to consider that, as a sort of assembly, SQL is as good a language to express relational algebra as any, given that many devs don't usually write a lot of raw SQL themselves.
When you look at the ORM and query builder space, there are a lot of different solutions - you don't have to like them (for example, I don't generally like ORMs), but they definitely exist, and they have all sorts of capabilities for type-safety, composability, and so on.
Actually, SQL is used to hide DBMS implementation complexity and diversity. It was a great abstraction when RDBMS tech progressed rapidly. Now it has matured to stability, and it makes sense to try a better language.
Yes, it's also used to hide implementation complexity and diversity: the advantage of a standard (even a "weak" one).
> SQL is used to reduce network roundtrips.
What I mean is: the database has an API that allows the client to specify (almost) arbitrary complex queries and operations. Regular REST APIs don't support this. GraphQL also supports it. The client sending queries to the database is a great way to reduce network roundtrips. Key-value stores that only support get/put/remove/list require a lot more network roundtrips for slightly complicated operations.
> it makes sense to try a better language.
I argue that the language itself (the syntax) isn't the main problem: part of the problem might be that SQL (even with a better syntax) would _still_ have very similar problems: missing indexes / wrong query plans, missing encapsulation which results in a complex mess of entangled code, things that work fine with small data sets but are extremely slow with large data sets,...
Right my bad, I was thinking of storage engines in general. And of course maturity doesn't mean lack of innovation - which is still happening in the relational space.
At least with the substrait[0] project there is some push for standardization & interoperability when it comes to the relation algebra and query plan.
I think this can unlock a lot more experimentation when it comes to SQL-alternative "languages" or approaches where plain text languages are just skipped all-together and are instead replaced by good libraries that act directly on query plans.
Well, I'd argue that most of the time SQL is treated as a glorified CSV.
No-one really looks at it as a real language, which it actually is.
As for optimization, I'd say that it might be too efficient for its own good. As now one can have its ORM generate a huge request and complain to the DB if the SQL is slow.
If you write in any other language, you usually don't complain that the processor is badly optimizing the asm execution. But you blame the compiler. ORM on the other hand are mostly never blamed, just pitied if they fail.
Having Wasm inside the RDBMS might be a much better approach. But then it will be up to the wasm program to be efficient.
> There is little innovation in the database space: there are hundreds of general-purpose programming languages, but very few "database" languages.
What we have is generally good enough for the purposes of set manipulation. SQL is not a programming language, and should not be approached as a programming language. It's a way to express relational algebra.
This is similar to complaining about any other system of algebra having issues... and wanting to come up with a New Way. We don't really need a new way to describe linear algebra; what we have works. And it may seem unintelligible and a mess from the outside, but if you take the time to learn it, you'll understand why and how it's laid out the way it is, and learn to get over it.
The other point being, there are a set of operations you can do on sets. SQL is as low level as you're going to get in regards to that. Any new language would either: change the labeling of the underlying operations (e.g. changing joins as a label for Cartesian products to some other label), abstract away the low-level operations into something completely else (foolish, in my view), or tweak it in minor ways to suit the preferences of the individual (again, not worth the effort in my view, just learn the syntax). All of these are inefficient; and all endeavors to make it more efficient have failed (as far as I can tell) -- primarily because the people working on these projects (such as your self) do not have a sufficient maturity in this space to understand why things are done the way they are, i.e. "Chesterton's Fence."
> Many developers end up writing huge SQL statements (one statement that is hundreds of lines). You wouldn't do that in a "proper" programming language; but in languages like SQL (or maybe Excel) this will happen.
Programming language and set manipulation languages are incomparable. The first is a language for describing procedural instructions and the last is a way to describe set transformations. It's not wise to apply the same set of standards to both. It's like complaining a proof is tediously long, when the actual underlying operations are simple -- you're missing the point. SQL has to be long and detailed -- when you want to be sure the data you're receiving is exactly the way you specified it to be (i.e. no compiler shenanigans where it turns your rough intent into concrete instructions).
However, I will concede that I've seen some monstrous SQL in the wild... mostly written by people who don't really know SQL... and which could've been greatly shortened by knowing the little tricks that are database-specific (and similarly, I've seen enough people request a dataset to do processing on the server-side in herculean efforts, which could've been done faster and written quicker in the database).
> Another problem is proper encapsulation. Views can be used, but often developers have access to all tables, and if you have many components this becomes a mess. Regular programming languages have module system and good encapsulation.
I don't understand. Permissions can be tweaked however which way you want. Am I missing something?
> SQL statements often don't use indexes, for one reason or the other. With regular programming languages, it is harder to end up in this position (as you have to define the API properly: encapsulation). Regular programming languages don't require "query optimizers".
This is a trivial problem; but if you don't know why it happens, I can understand being befuddled by it. Query optimizers are involved so your access to the underlying data -- and the manipulations on it -- are done in the least costly fashion (i.e. in a rough sense to reduce the algorithmic complexity of your SQL to its least possible complexity) in regards to the set of hardware your cluster is running on, and the various access statistics involved. Regular programming languages do not have "query optimizers" -- but various libraries do have their own optimizations to reach the same end (as do compilers... compilers are nothing but optimizations upon optimizations in the same vein).
> SQL is used to reduce network roundtrips. Clients send SQL (or GraphQL) statements to the server, which are actually small programs. Those programs are executed on the server, and the response is sent to the client. (Stored procedures and for GraphQL persisted queries can be used - but that's often happening afterwards.) Possibly it would be better to send small programs written in a "better" language (something like Lua) to the server?
It would be very inefficient for general purpose. For most cases, the query optimizer can take your SQL and bring back your result sets in the most efficient way possible (in relation to the SQL you've written).
If you want to do the underlying operations yourself, instead of relying on the query optimizer, then you can use a compiled language like C and write your own extensions -- though it would take much more effort to do it properly than simply learning how your database works, and how to work with the query optimizer (rather than against).
SQL is a programming language. It's simply one of if not the only successful fourth generation programming language in a world replete with third generation ones.
The error is confounding a fourth generation DSL for set theory with a third generation general purpose application programming language. There's a really good reason ORMs have their impedance mismatch with relational databases.
Aside from this minor nit, your comment is spot on!
Well, that can be said about any programming language (assembler, C,...) but yet there is innovation in this area still. And that's good. But in the database space, there is little innovation. And no, I don't think what we have (SQL, GraphQL) is good enough.
> SQL is not a programming language
Yes and no. SQL is an API that allows changing the state. But (same as programming languages such as C) it has advantages and disadvantages.
> if you take the time to learn it
Again, that can be said about assembler as well :-) Or Excel. You can do a lot in Excel.
> SQL is as low level as you're going to get in regards to that.
Absolutely no: SQL is a high-level language. And I argue you don't need a declarative set-manipulation language at all, to store and retrieve data. Computers are not set-manipulation machines: they are instruction-based machines. And the popular programming languages reflect that: the most popular languages (Python, C, C++, Java, C#, Visual Basic, Javascript) are _all_ imperative languages, which is close to how computers operate. And that's relatively easy for people (developers) to understand. Functional languages are not that popular. So, SQL is an exception here.
> SQL has to be long and detailed
I argue the reason is that SQL is not a very good abstraction, if you need extremely large statements that are hard to understand and almost impossible to maintain. Same for Excel.
> I will concede that I've seen some monstrous SQL in the wild... mostly written by people who don't really know SQL
Yes. I argue that SQL _favors_ (or at least allows for) writing complicated, messy code that lacks encapsulation. In a larger company, you will have some percentage of people that are not all that great, and you don't want to pick a language that will easily result in a complete mess. Languages such as Java favor modular, well encapsulated code: it is hard to make a complete mess in Java.
> Permissions can be tweaked
Permissions are not an encapsulation mechanism, in the way "private" fields in Java / C++ are protecting other programs for accessing internal.
>> SQL statements often don't use indexes
> This is a trivial problem
Well, for the experts, yes. But not for the novice programmer. And you will have some percentage of novice programmers. In a larger shop, you typically (still) need a "database administrator" or a group of (experienced) people that ensure indexes are made correctly.
> various libraries do have their own optimizations
Libraries typically don't have optimizers. Compilers have, yes, but those are not picking indexes or such that can result in one million times slower or faster runtime. The compilers typically can speed up things by a factor of 10, at most (constant factor). Query optimizers are very different. (Again, I wrote 3 relational databases, and 4 query optimizers.)
At first glance this seems more confusing, particularly the grouping/aggregation syntax, though I suppose that's something I'd just get used to. Some of the syntactic sugar is nice, but some things are also unlike SQL for no apparent reason which just makes adoption harder than necessary (join syntax for example).
IMO the main selling point would be the "database agnostic" part, but I already achieve that through SQLAlchemy Core and/or a warehouse layer like https://github.com/totalhack/zillion (disclaimer: I'm the author and this is alpha-level stuff, though I use it regularly). It seems like many newer DB technologies/services I'd want to use either speak PostgreSQL or MySQL wire protocol anyway.
The roadmap is worth a read, as it notes some limitations and expected challenges supporting the wide variety of DBMS features and syntax. That said, I can see where this might be useful in the cases where I do have to jump into direct SQL, but want the flexibility to easily switch the back end DB for that code -- that's assuming it can cover the use cases that forced me to write direct SQL in the first place though.
This is getting transpiled to SQL, right? So I still have to understand how the (now generated) SQL will perform on my data and DBMS, plus I got to learn this new syntax. This will be a hard sell.
If that was the end of the story, no transpiled language could ever succeed. But they sometimes do. One could say the very same about e.g. MarkDown which typically renders as HTML, so you still have to understand HTML to a degree. And in the right environment (e.g GitHub repo readmes) you can always fall back to HTML in places where the transpiler support is lacking.
The great thing about transpilers is when they 1) produce tractable, nicely formatted output and 2) you get an 'escape hatch' so you can fall back on the transpilation target. Because then you can always 1) check the output to understand details of an unfamiliar construct, 2) just use the target language in cases where you want to copy-paste a solution written in the target, and 3) just opt out of using the transpiler by using the code it produced as a starting point for your new implementation.
TypeScript doesn't introduce new runtime semantics in practice, in 99% of the cases the generated JS is your TS code with types erased. There're no magic keywords that expand into pages of generated JS.
They implemented the `x.y?.z` syntax pretty long before JS did, so that was at least transpiled for a while. I'll bet there are more features like that.
not exactly. Just like with the regular code, most of the stuff done on database are pretty trivial and not resource heavy. And the things that are really perfomance-critical are usually crafted very differently even in SQL
PRQL is not aiming to be an ORM or data layer replacement and is focusing on a specific use case, viz making it simple and ergonomic to write *Analytical Queries*. See the the FAQ [0] for more on this.
In most cases you want to be able to interactively do some data exploration and craft your query as you go along - the sequential flow of PRQL is great for this as you can simply add lines as you go along.
For most people, the RDBMS query optimiser will do a good job of turning the resulting SQL into an optimised query plan. If you need to hand craft the production SQL code for the last few percent of performance, then PRQL gives you full access to the SQL to do that. You probably will still have saved yourself development time by generating the SQL from PRQL in the first place though.
If you're spoiled your friendly neighbourhood DBA will help you there.
The problem I have with these tools is that you then have to reincorporate their optimizations in such a way that the transpiled SQL is identical. If you have to resort to an ORM expression API or raw SQL you gain nothing and are arguably in a worse situation.
SQL query optimisation has been studied since the days IBMers were competing against Quel. If the transpiled SQL has sensible optimisations performance could be equal to or even faster than hand-written SQL. I don't see how this differs from a "language extension" that adds a bit of Pythonic flavour to SQL, which IMO is the right step forward.
I really really want this to take off. SQL sucks. No, I do not care if relational algebra is beautiful, or simple, or represents some other mathematical purity which makes it the best way to represent data. SQL, as implemented, is a rough language which makes too many things difficult.
Unfortunately, it is really difficult for me to lean against this for anything “real” without more years of battle testing. Queries have to be right. I do not want something to blow-up one day because the translation hit a buggy edge case. SQL has many flaws, but everyone knows it. Nobody ever got fired for writing SQL.
SQL does have faults but PRQL is not fixing any of those and instead is just a python-esque version that retains all the SQL faults and introduces additional complexity without providing much in return.
Main complaints about SQL are that SELECT comes before FROM and that there are too many dialects.
The SELECT part is only partially valid when writing the query but it is much nicer/easier to read than FROM first options presented elsewhere. This same issue plagues majority of programming languages, most of which optimise for terse syntax that is easy to write but painful to read even when most of the time reading is what people need to do when writing in a programming language.
Dialects is a silly one. Programmers treat databases as interchangeable back end instead of foundation of the data storage. No-one ever even thinks about complaining that you can't swap C for Python or Java without major rewrite yet these languages are nothing but glue between database and UI that SHOULD be interchangeable at any time.
To my untrained eyes, SQL did not appear to have things like windowing, hierarchical relationships, (trees, graphs) in mind when initially designing the language.
Support for these things have been wedged into the language by numerous additions, and I think the fundamental syntactical framework of SQL is cracking under the pressure. Granted, it's been a while since I've had to traverse a tree using SQL, but I do recall that the query was very painful to write, and even more painful to execute efficiently.
Also, these things only work if you've wedged your data model into rows and columns. (or you use non-standard language extensions like jsonb)
PRQL, on the other hand, rearranges the syntactical structure to be more extensible, and also agnostic over the underlying shape of the data model. It works the same regardless of whether you're using an RDBMS or a graph/document database.
Is the result more complicated? IMO, only in the simplest examples. PRQL shines vs SQL when you have multiple stages of aggregation, like pre and post aggregation filtering, or referencing synthesized data, etc.
is how graph/tree queries work in modern SQL (assuming you aren't using something like ltree [0]. The first part of the recursive CTE identifies the root record and then UNION ALL with the subsequent items referencing what came before.
Is it as easy as a Cypher query in Neo4j? Not all all. But it is arguably more flexible in that you can represent far more data structures and link to non-graph data easily.
I just unwrapped nested JSON columns into a flat key-value set with it last week. It's deceptively powerful.
> No-one ever even thinks about complaining that you can't swap C for Python or Java without major rewrite yet these languages are nothing but glue between database and UI that SHOULD be interchangeable at any time.
One of the main benefits of Python is that you can easily swap it for C. 95% of the stuff I do in Python every day is only possible due to existence of numpy.
And in general programmers care about interop a lot. Interop with C was, obviously, one of the main points of C++. One of the benefits of JVM languages like Kotlin or Scala is simple bidirectional interop with Java.
Docker became so popular because you can simply containerize your app and run it anywhere without being tied down to specific requirements of your environment.
Yeah, SQL has plenty of flaws but PRQL does not seem to be fixing any of those and instead seems to be introducing new issues, e.g. what is up with 'derive country_code = s"LEFT(country, 2)"'.
This doesn't improve SQL for me at all. How exactly is this better than SQL? It still uses variables before they are even declared so the code flow is still all over the place. There still seems to be no real "coding" possible (function declarations, reuse, typed variables)
How can you possibly say that this is any better than SQL? SQL is bad but this doesn't improve it.
What really feels better would be this but unfortunately it's proprietary and only for one specific database: KQL (kusto query language)
I am a simple man, and I do not think of SQL in terms of relational algebra. I was inelegantly attempting to cutoff the SQL proponents who will say that SQL is fine because relational algebra is such a beautiful model.
Yes, SQL can do everything. That being said, I wish the syntax were significantly cleaner and composable.
For an alternative, GraphQL-like take on a related problem ("efficiently query any combination of data sources, from APIs to DBs to raw files"), check out the Trustfall language mentioned in the sibling comment: https://news.ycombinator.com/item?id=34182125
It's used in multiple real-world applications, including a tool that is slated to be merged into cargo, Rust's built-in package manager (cargo-semver-checks, a semver linter).
SQL is not relational algebra. It's relational algebra, and various relational calculi, mixed up in an ugly way because they tried to make it look like natural language.
That PRQL is not composable makes it —- so far as I can see —- mostly an alternative syntax.
What I’d personally really appreciate is the ability to split queries up into individuality testable and reusable parts. I think that’d be game changing because suddenly large sql code bases become manageable.
You might enjoy the FunSQL library; SQL verbs are represented using regular Julia objects and you pipeline them together to build bigger queries. A key difference to other SQL builders is that you assemble a tree of "tabular operations" instead of a SQL syntax tree.
IMO this will only appeal to people who don't like SQL or that have a narrow set of problems to solve.
First of all, the number of different dialects is a non-issue because projects that have to send the same queries to different databases are very rare.
Second, in my world where performance matters, you need to hand optimize queries. This is done quite specifically based on the database load and size, let alone specific quirks and strengths of the actual DB product. Something like this would only get in the way.
Third, even if all of these aren't important to you, and you just want to avoid SQL because you don't like it, there are plenty of ORMs in circulation. Many of them allow to specify queries in a fluent fashion.
Finally I think the cardinal issue is expecting that inventing a new dialect will change anything -- imagine the PRQL that compiles to a hundred-lines complex SQL query: would it be necessarily more readable? I doubt it.
And it also rare too switch database vendor mid project, some rare constraints if that happens or if you have too much vc money.
And you should use your database to its full capabilities, including vendor specific queries, otherwise you are just wasting computation. Becuase of this I don't even use query builders anymore, just raw SQL.
"projects that have to send the same queries to different databases are very rare." That's true in business certainly but once you head into academia or are working with medical data / the public sector you often find yourself having to write queries against databases from different vendors to construct datasets, which gets annoying fast. I used to work as an RA in an economic modelling group and hit this issue a few times, so I can see the use of something like this in that niche.
I was dealing with a tricky SQL problem yesterday: take a table of file paths and return a JSON tree with `name` and `children` keys. Definitely requires a recursive CTE, but I ultimately gave up an implemented the transformation in the host language.
It was the first thing that I tried in the playground, since it was top of mind, but unfortunately it doesn't seem like the language supports recursive queries (yet). https://github.com/PRQL/prql/issues/407
I wrestled with a graph traversal problem and Postgres docs for a hot minute before realizing that the "recursive" working set can only access the results of the previous iteration. With that in mind, I took a crack at your problem with the following approach:
1. Build the JSON from the bottom up starting at the deepest path.
2. At each step, handle two cases: a) build the parent of the working set, and b) collect any sibling paths at the same depth.
3. The last step aggregates all children together since all paths descend from '/'.
There was a fair bit of trickiness. I used a CTE in the recursive step to reference the recursive table multiple times and to enable aggregation in the recursive step. I used a few helper functions for my own sanity.
I think EdgeQL (https://www.edgedb.com/docs/edgeql/index) is a much better SQL replacement, because it is composable, and there is already a working DB using it.
What a non-engaging website. I can't see an example. Don't make me install anything. Don't point at a blog which rehashes SQL's negatives. Don't give me a tutorial that shows for ten types how to have a query return a constant. Especially don't show me queries that should return a constant but cause an error because of some dumb syntactic problem! Show me what it looks like and why it's better than SQL.
Indeed, that would have been a better link. A bit hard to guess. The linked page has "get started", "overview", "learn", and "blog". I expected to find it somewhere there. My bad, but I never take much time to trawl the whole site, and I think many with me.
The pipeline style syntax is super cool! I wonder what's the author view on Kusto query language [0], which is using a similar thread/pipeline style of syntax.
SQL standards are a weird thing. Rather than wait for a spec and then get implementers like in the browser space (for the most part), SQL vendors will solve a particular customer's problem, submit their solution to the standards body, the committee will tweak the syntax for best coherence with the existing spec, and then other implementers may emerge.
This ends up with four slightly different syntaxes and the one that matches the standard wasn't even the one that proposed the feature in the first place! It is what it is. Bottom line: "close enough" is as good as it gets in the database arena, so choose your implementation wisely and go all in.
With that out of the way:
• Multiple vendors support RETURNING from both INSERT and UPDATE statements.
• Upserts exist with most vendors, typically with the ON CONFLICT DO variant.
• Most support MERGE now in the latest versions.
• I don't know of any popular database that lacks support for INFORMATION_SCHEMA queries, so listing all tables is in fact both in the standard and supported by all popular implementations.
- PRQL is focused on analytical queries and it does not support DML (yet at least, see https://github.com/PRQL/prql/issues/1092). Thus there is no UPDATE or "upsert"
- What is merge? How is it different from join? How is it different from union?
- SQL does define `information_schema`, which can be used to list tables and columns. We had ideas about introspection and listing tables, but this is not on the roadmap yet. A probable outcome here will be improvements to tooling so you will be able to find tables via autocomplete (LSP).
Merging as a one-keyword feature for all flavors of SQL is almost impossible.
It gets hairy when you have columns with composite types. E.g. depending on database, records can be JSON objects, protobufs, structs, or other composite types, like tuples.
It is possible to define semantics to merge each of these, but they vary and the merge operation becomes verbose to allow to handle each weirdness of each of these underlying types.
Merging is also sensitive to NULL vs 0 vs a DEFAULT, plus in some databases, NULL is still allowed even when a column is marked NOT NULL.
You'd almost need a sub-language for the merging operation, specifying how to handle each corner case.
I wish this was more programmatic and less stringy. Intermediate layer of string queries feels redundant. But I understand it makes clear separation of layers and concerns
LINQ is neat because it maps from the C# IEnumerable API to SQL. So you use the same methods to work with in-memory collections as you do database tables, which leads to a very natural feeling programmer experience. Compare to Rails' Active record where `things.where(foo: 6)` calls out to the database but `things.filter { |t| t.foo == 6 }` will pull all the things into memory and then iterate through them to create the filtered list. In LINQ those are both `things.Where(t => t.foo == 6)` and the type of "things" determines where the query executes.
This needs language / introspection support when converting to SQL as you need to pick apart the expression's AST to figure out that it's eg. a simple equals check on one of the fields.
Well one of the things that makes it neat is that it works for any Monad. Where Monad means any implementation of the following method
F<Y> SelectMany(F<X> input, Func<X, F<Y>> map)
this could be a function where you have a list of things, apply a function to each element that returns a list and the result is the concatenation of the individual lists. Or you could use it for stuff like tasks where the result is a task that returns something like `await map(await input)`.
Could you provide some examples of non trivial SQL queries expressed as PRQL and then what PRQL actually produces as a result?
As others have suggested, I do wonder how optimal the translation is for non trivial cases because that often matters both for cost and performance reasons.
The goal is for compiler to be smart about what SQL constructs are used. We want to lift at least some burden of regard for performance from user and pass it to the compiler.
But there is a caveat: when writing SQL, you only have to know roughly how does the database optimizer work. When writing PRQL, you also have to know a little about PRQL compiler.
At my previous employer, data scientists would regularly spend entire days only running 3-4 Hive queries and waiting hours for the results to come back... job failed. PRQL looks much easier to use, and the goals of good auto-complete and type-checking will make it even better. If any corporation with lots of data scientists would like to fund work on an open-source project, this seems like a no-brainer.
For the last two years I've been using dbt for some ETL work. It is quite useful, but at the same time quite ugly and clunky. It's basically just a bunch of macros, and it gets unwieldy pretty quickly. It's weird to have to declare the dependencies between the nodes in the DAG when they should be easy to figure out from the code. And dbt is also very restrictive in that it's designed around transformations that are functional so making more procedural transformations is really hard. So I think the world would buy a much better dbt. In other words, I think your focus on analytical queries is probably a good idea.
What makes using something like dbt palatable is that there really aren't any good alternatives in this space. dbt does have some strengths that are worth looking to, such as the extra tooling you get by using their IDE, the generated docs, their ideas about a metrics layer.
I think that for PRQL to be really useful it needs to extend beyond just the syntax and really make writing these data pipelines much easier. The syntax of the SQL isn't the worst part of it. But I think there's a big potential market here.
I'm wondering if it would be worth skipping the SQL layer and compiling straight to the execution plan level. This would make PRQL tightly coupled to a database, but would also unlock performance and fine-tuning potential. And, maybe it would be possible to run it inside the database process, enabling data processing right at the source.
(eg. in Postgres, this would be more or less a replacement of the PREPARE statement.)
PRQL actually goes through two intermediary ASTs (PL and RQ) before being compiled into SQL. This is not that well documented yet but have a look at the comments at the top of the lib.rs [0]. Both representations are accessible as json in the Rust and Python APIs.
On the roadmap are plans to support DataFrame backends such as Polars, Pandas, Ibis, ... and Substrait.io [1]. The aim of Substrait is similar to what you describe, i.e. to create a universal relational algebra representation that can be executed directly by backend systems without having to go through SQL.
> I'm wondering if it would be worth skipping the SQL layer and compiling straight to the execution plan level.
You have to join five tables in a query. What join order will you pick in your execution plan? Will you use the same ordering for different WHERE clause parameters? What if you add an index?
What you're proposing was tried in the 1970s (see CODASYL) and it was a terrible idea.
This reminds me a lot of R's dplyr which is by far my favorite data manipulation suite. I have never enjoyed pandas even though I use it every day. This project is one of the few SQL offshoots that I actually hope succeeds. Well done and may you find your way to the hall of the data kings next to dplyr and ggplot.
The last example in this notebook [0] shows how similar PRQL is to dplyr. The rest of the notebook shows how you can use PRQL from R, Python and the command line.
This looks awesome. I want to integrate this with my SQL tool (using JS).
I've noticed the limited standard library. I wonder if prql supports "passthrough" function where it just uses function as-is e.g. if I invoke "random_function(a)", it transpiles to "random_function(a)" in SQL.
In the examples, they show S- strings. If I understand it correctly, s"random_function(a)" does what you want. The question is whether you can rely on `a` that you reference being still named `a` in SQL.
Correct, s-strings are the general "escape hatch" to allow pass-through for things that are not yet implemented, like dialect specific functions. See the section in the PRQL Book [0] for more detail. It does make it opaque to the compiler though so the user becomes responsible for ensuring symbol names are valid and accessible in the scope.
I wish they made it more obvious in the docs. S-string is not something I would look at when trying to figure out how to use "NOW()". I spent a reasonable time looking for a way to call functions and then just abandoned the idea of prql since it seemed impossible.
I just tried to do what I would do originally and the steps were:
- look at the "functions" example on the main page (an opportunity to mention s-strings there)
- search for "function" in the prql book (s-strings are about 18th position, I probably bailed out before that)
- search for "prql SQL function" and "prql native function" - neither returns the answer on the first page
The sstring on the front page definitely helps a lot. I think just a bit of SEO and a right phrase here and there mentioning "SQL function" and "native function" could help a lot as well.
I initially had concerns about a proposed change to a 50-year-old language potentially disrespecting the power of SQL, but ultimately liked the change because it improved SQL readability and addressed the issue of filters being placed at the bottom of SQL statements.
What would be cool is "take this view with N columns, and add back in these other Y columns that were projected away" or "add/remove/modify these columns from the source table while preserving whatever others there may be".
What you are describing sounds like the `derive` transform in PRQL: it allows you to add new columns based on existing columns you already have. See [0].
Of course this is mostly also possible in SQL but PRQL makes this a bit nicer in that 1) `derive` makes it explicit that you'll be only adding columns and not projecting any away, and 2) you can refer to the newly created fields straight away.
This looks cool. I like SQL but I have to admit it’s often cumbersome.
Something that came to mind after looking at some example queries is the mix of parentheses and square brackets. It’s not immediately obvious to me when I’d use which.
Also (and this might just be a personal pet peeve), why do new languages insist on using flashy new keywords when the status quo would do just fine. What’s the point of filter/derive instead of just calling them where/select.
I sometimes notice languages really go out of their way to ‘be different’, I think it’s xslt (or something xml-related) that uses when/otherwise instead of just if/else. Everyone knows if/else, stop trying so hard :P
“select” implies that you're selecting from some options, not creating new values. “where” is a preposition whereas “filter” is a verb, which fits the pipeline approach better.
Sure, that’s fair. You’re not wrong. But PRQL is meant to substitute for SQL, my argument here is keyword familiarity outweighs absolute linguistic precision.
Neat, but why make a new language when you could build something with better ergonomics using client libraries (essentially a newish form of ORM)? It all transpiles to SQL anyway.
This comes up periodically, but it's never interesting. SQL is good. What people don't like about it is all rooted in their unwillingness to learn it, on its own terms. They invariably want to import ideas from imperative languages that simply don't make sense within the functional/relational paradigm.
I don’t understand this attitude. We have many tens of programming languages for logic. Why is it that we really only have one that’s portable and that we came up with this idea 30 years ago? That just doesn’t seem plausible to me.
Almost 50 years ago in fact. And there were alternatives all along that timeline and many before it.
SQL is the first and possibly the only successful fourth gen programming language, and the most substantive complaints generally boil down to "SELECT should have been after FROM and WHERE". The vast majority of the seekers to the crown have been third generation languages trying to fake it with relational algebra, and data is just too important to allow side effects.
In short, don't assume because something has lasted in the computing world for 50 years that it is fundamentally flawed and only due to inertia. That would be a very foolish assumption indeed.
SQL is a DSL for set theory and as such any viable replacement will look substantially like it. When the differences are subtle, the impetus to retool/rewrite just evaporate.
I tried to play with it, rewrite some code to it to find out if it really is more readable ... but without a CASE statement, this can't actually do much business logic yet, so it will have to wait.
SQL really sucks. In general I'm not a fan of SQL-like syntax since it makes queries with lots of joins really difficult to read -- it's hard to keep track which value came from where.
GraphQL has some neat ideas and I really love the clarity the nesting provides. But GraphQL is also too limiting: it can't express arbitrary filters, recursion, optional edges (i.e. left joins), etc.
My own take on this is an embeddable GraphQL-like language with all those features and with strong guarantees about execution performance. I've been calling it Trustfall, and you can play around with it (or check out example queries over HN data) in your browser here: https://play.predr.ag/
>it's hard to keep track which value came from where
Isn't that dependent on how you write the query? If you put the table name or it's alias before the column it's quite easy to keep track of the data source.
That's what I always tell people advocating for plastic straws.
They think plastic straws are great, they drink their cocktails with them and everything.
Then I tell them how when I try to wash my cars by assembling 50 straws into a water hose, the whole plastic straw concept quickly breaks down.
In my opinion you should never use plastic straws.
I mean, they're also talking about using the same table multiple times - aliases aren't just useful, they're required in that case so the engine knows which one you mean when using a column from one of them.
Best guess without actually seeing these queries, whoever wrote them is using bad aliases (like "foo1", "foo2" and so on) instead of something more reasonable.
This is pretty much how I already write SQL, just using CTEs. In fact I bet the most straightforward way to transpile PRSQL to SQL would be using CTEs.
Am I understanding correctly that it's because there's no "select" at the end that chooses variables/expressions to become columns in the output? Instead, you specify the output columns as you go? In LINQ we only have "let" and no "derive", but instead we have to say "select" at the end to list the columns.
As an example why would WHERE clauses need to have a AND instead of multiple WHERE clauses. When working with big tables, its super annoying to select all but one column. So why not have syntax for this? Eg SELECT !username from users could expand to all columns without username.
I feel SQL has so many low hanging fruits from a user perspective, but its very slow moving, and the standard is even more slow to make progress.
So SQL with sugar that compiles down to vanilla SQL has a market imho. I will study PRQL more and see if i could use it on a daily basis.