Hacker News new | past | comments | ask | show | jobs | submit login
Parsing SQL (tomassetti.me)
158 points by teleforce on Aug 23, 2022 | hide | past | favorite | 106 comments



I'm surprised this doesn't mention the class of parsers I believe to be the most useful - the database in question itself. For example I make extensive use of https://pglast.readthedocs.io/en/v3/, which is the postgres SQL parser extracted into a standalone library (pg_query). No 3rd party library attempting to parse a particular flavour of SQL will be more accurate than the database that defines that flavour, since you're also getting all of the edge cases and quirks of it's impementation.


In the same vein, I found this official library for parsing Microsoft T-SQL: https://devblogs.microsoft.com/azure-sql/programmatically-pa...

I used it for a nasty global search & replace where a statement with quoted strings was sometimes embedded in quoted strings.


It's nice but it has a buuuuuuuunch of bugs - the published railroad diagrams/bnf grammar dont even match behavior in the wild so I feel bad for them, but you can see intellisense fail for similar reasons.


Anyone know of the same for MySQL/MariaDB? I've found their grammar in the repository, but having a library would be better.

I've found unofficial attempts but they have all mishandled parts of SQL compared to the built-in MySQL parser.



Been reporting many ~bugs~ spec inconsistencies there :)

Apart from parser differences there's no support for SQL_MODE options yet.


I think the article focuses on SQL parsing that builds an AST that's usable in the programming language of your choice to analyze/transform the AST. so you'd have to write a mapping at least. Otherwise I agree.


Yep, it depends on what you want to do. Validating the SQL is correct is handy -- of course, but sometimes you need the AST.


pg_query does return the AST.


How do I pass my programming language as parameter to it?


Why would you need to pass in your language? The AST is data. You can then take that AST, which is already validated and disambiguated and act upon it in any way that you'd like.


Because I want to analyze and transform the AST with the tools and libraries that my programming language offers. To be more precise, I want the data to be datatypes of my programming language and not JSON or some custom structure that I have to first parse myself.


Yeah this also seems very weird to me. Like, okay, that's cool, but did you ever hear of this thing called integration testing?


>SQL is an old language and not designed for large scale programming, it is not a language that developers will love. Even worse, they will not be very productive with it.

What the hell. I stopped reading at this point.


It's true though. Relational algebra is a very nice abstraction but SQL as a language is subpar to say the least. SQL is really unergonomic, it's difficult to reuse SQL code which is why there are no SQL libraries worth talking about and it's the reason why people go out of their way to create tools to replace SQL like ORMs or LINQ.

SQL is basically a DSL for data processing just like XSLT for example. Would you call XSLT "designed for large scale programming"?

The "they will not be very productive with it" comment is a bit of a stretch - it can be very productive in the domain it is designed for - but you can often replace SQL at its job (like for example use Dataframe API in Spark instead of SQL API) while there are many tasks that SQL is a very poor fit for.


As much as SQL could be better, it's a very strong local optimum. There is no credible replacement whatsoever for what it does.

It's true people try to replace it, but most solutions are of rather questionable quality. I personally argue against ORMs in pretty much any situation.

> SQL is basically a DSL for data processing

Yeah and what's wrong with that? SQL databases power everything, including data sources with billions of records that serve millions of customers, with hundreds of developers/DBAs working on them. Does that not count as programming in the large?


> There is no credible replacement whatsoever for what it does.

I'm compelled to agree with this under protest. There was a language called Dataphor based on D and the Third Manifesto, it's unfortunately hard to find even sample code any longer, but it would be a strict improvement on SQL.

Did I mention you can't even find a corpus? Good luck running any implementation on a modern system.

I'm reasonably content writing SQL, but I know a syntax with the same power but lacking several disadvantages is possible, and I'd rather use a mature implementation of that instead, if I were able.


The worst part of SQL is indeed how hard it is to compose. Something like D, where you would keep the relational model but add user-defined types would indeed be a strict improvement.

The reason why I'm skeptical of most proposed alternatives is that I'm fundamentally in your exact same position: I know SQL isn't perfect but I'm reasonably content with it; and they invariably all end up throwing away the "good parts" of SQL (relational model, declarative, easy stuff is easy).

The first step of an hypothetical solution that replaces SQL isn't "SQL sucks", it's "SQL is extremely good at what it does but has problems that are only fixable with a new language".


The Dataphor stuff was all open sourced I believe. But it's in a C#/MS ecosystem, and tied into their overall application builder tool etc. Not really a standlone product.

There's other "D" type projects. Rel is one. Don't have URL handy but it's fairly easy to find. They publish their grammar and example code.

None of these tools have ever matured or become popular. I have theories why, we could discuss forever.

I think the issue has more to do with two things: a) most people don't understand the relational model fully, so they have no idea what they're missing b) new databases (and existing) simply cannot afford to rock the boat here because they need customers. And in terms of the architecture of the DB system, the SQL parser is one of the lower effort items (when compared to query planner, optimization, storage impl and storage optimization, replication, etc.) So why invest there for little value? Customers aren't asking for it.

The company I'm contracting for right now has some interest in working in this space.

Date & Darwin did good work with "The Third Manifesto" but it went almost completely ignored, and the tone and target of it may have been off. Frustratingly we went through a phase where SQL went out of style and then back into style ("NewSQL") and so there may have been a window missed there where alternative query languages (but still based on the relational model) could have risen. But instead "NoSQL" was too interested in jettisoning the relational model along with SQL (mostly I would argue because they don't understand it).

There has been some recent rise in Datalog implementations in the Clojure community. That is interesting, though not strictly as an alternative to SQL.


>As much as SQL could be better, it's a very strong local optimum

I mostly agree, though I think that SQL's popularity comes mostly from network effects.

> Yeah and what's wrong with that?

Nothing is wrong wrong with that. I just think that it's reasonable to interpret the words "programming in the large" as "being usable as general purpose programming language" (which I admit is not a very strict definition) whereas SQL (or XSLT or awk or bash...) seem well suited for certain niches (important niches nonetheless).


Prolog is a better RM language than SQL. (Codd had a language called Alpha but that's about all I know about it.) It's not a credible replacement for all use cases, but it's pretty capable.


>SQL as a language is subpar to say the least. SQL is really unergonomic, it's difficult to reuse SQL code

On the DML side I've found SQL to be very reusable at any company I've worked for - grep/search repos for the tables of interest, throw the queries in a CTE and you're off to the races. If the data infrastructure is robust you can probably just query from ([un]materialized) views - quite literally SQL code reuse. And even across vastly different domains, even if not directly resuable, SQL queries are still highly transferable. I can see it being more true on the DDL side but even there, at least anecdotally most of the DBAs/devops/infrastructure engineers etc I've met seemed to have favorable impressions of SQL when needed in between the tooling.

>while there are many tasks that SQL is a very poor fit for.

You probably don't want to try and perform graphical rendering tasks, and it's true certain use cases like dynamically generating SQL statements from table/column names _and then executing_ them requires some manual input. But which data/processing related tasks is SQL a very poor fit for?

In my experience it's much more common to actually see the opposite - inefficient usage of ORMs, exporting datasets only to then perform pre/postprocessing, often requiring building programming "jigs" to circumvent bottlenecks, etc when it could've been done in a more streamlined manner in the DB/warehouse.


I agree regarding SQL views - it's probably the most frequently used way of reusing SQL code.

> And even across vastly different domains, even if not directly reusable, SQL queries are still highly transferable.

That's the point, they are probably transferable in the "copy, paste and tweak" sense which would be highly frowned upon in other popular languages.

> But which data/processing related tasks is SQL a very poor fit for?

I specifically called SQL "a DSL for data processing", I probably should have written "there are many domains that SQL is a very poor fit for" to be more clear.

>In my experience it's much more common to actually see the opposite - inefficient usage of ORMs, exporting datasets only to then perform pre/postprocessing, often requiring building programming "jigs" to circumvent bottlenecks, etc when it could've been done in a more streamlined manner in the DB/warehouse.

Totally agree. Lukas Eder has some nice presentations about it.


Got it, I see what you mean, all fair points.

>Totally agree. Lukas Eder has some nice presentations about it.

Thanks - I googled him and instantly recognized the JOOQ blog, quality stuff.


>Totally agree. Lukas Eder has some nice presentations about it.

Thanks for the shout out! For the record, that's probably the referenced talk: https://www.youtube.com/watch?v=wTPGW1PNy_Y


It's nonsense. First off there is no such thing as "large scale programing". Maybe they mean big data? SQL is just a description of what data you want, it has nothing to do with how that query is implemented or optimized. Unless you think Spark or Bigquery aren't appropriate for large data, in which case I'd like to see how removing SQL helps.


Reusing SQL might be a smell?

I have seen people go crazy in stored procedures or via ORMs with super complex queries.

Avoid complex queries (with more foresight about how to structure, cache and query data) and you might not need the SQL code reuse.

There might be some less than ideal bits you would live to DRY up and would if it were OO or Functional Programming but perhaps let it be in SQL.

SQL is performance programming anyway so you are allowed!


> there are no SQL libraries worth talking about

I assume you mean libraries of SQL code (e.g. query fragments/templates) as opposed to libraries for working with SQL?


Libraries written in SQL. I'm sure one can search for let's say a snippet that will create a date dimension table for a data warehouse and copy it but it's pretty clunky compared to even Python modules.

I'm criticizing SQL as a language which is lacking in composability (as opposed to criticizing relational algebra or relational data model).

Libraries for working with SQL (full-fledged ORMs or simpler query builders) are themselves written in a different language so they don't necessarily prove anything about SQL itself, though one could argue that if people want to use them then they might not be satisfied with raw SQL.


A library written in SQL is called... a view. If you are repeating yourself in SQL to the extent you're reaching for a library, you might be doing it wrong. Take advantage of your DB's facilities instead of fighting it.

Or alternatively, if it's for things that really are mutating state and involving biz logic... a stored procedure.

Date & Darwin proposed the addition of "operators" to the relational algebra in their "Tutorial D" description of alternatives to SQL. That is, instead of "stored procedure", the addition of a type system matching on relations ("tables") and their tuples ("rows") and then the ability to create user-defined programmatic "operators" (a bit like OO methods) for those types.


I think the point was that it's not really easy to broadly share useful libraries of views or stored procedures. At one point Oracle was working on an integration that would allow you to package stored procedures written in JavaScript as node modules and then just install them to any database. Not sure what came of that.


> it's difficult to reuse SQL code which is why there are no SQL libraries worth talking about

I feel that it isn't SQL that is actually difficult to re-use. It's what SQL was designed to describe that is difficult to re-use: business entities.


If you've ever tried debugging a broken 100 line SQL statement created by some business analyst for some niche use then you might agree.


There was a time that doing exactly this (fixing, updating, or modifying existing queries) was a significant chunk of my day-to-day work. It certainly can be painful and tedious, but that's quite universal - certainly not specific to SQL.

Unlike "true" programming languages (not to get into a debate about SQL's completeness, we all know it isn't practically a general-purpose language) this is usually best approached differently than debugging a program. Whenever I encountered a mess of a query, I'd ask myself two questions:

1. What is the purpose of the query/what is the desired output?

2. What are the source tables used in the query?

At this point I'd typically rewrite the query, taking inspiration from the non-broken parts of the original if possible, and otherwise rewriting from scratch. In most cases this was faster (and the rewritten result often more efficient) than trying to debug a truly busted statement.


Have you ever tried debugging a broken 100 line python script created by a business analyst? It's not any better.


It’s better by orders of magnitude, the python script you can literally trace instruction by instruction, and process every which way you want.


But that's not a function of the language itself. If there were SQL debuggers you could step through a big query with, debugging SQL would be a lot easier. As it is, you often have to take the whole damn thing apart from the inside to see what's going on, then put it all back together again. That's... suboptimal, to say the least.


> But that's not a function of the language itself.

It is tho, SQL is a declarative language, so the execution model is largely opaque, by language design.

But even if it were solely an implementation detail, Closi would still be wrong: they literally stated that “it’s not any better” to try debugging a 100 lines python script than a 100 lines query.

> As it is, you often have to take the whole damn thing apart from the inside to see what's going on, then put it all back together again. That's... suboptimal, to say the least.

Which is the point? Sounds like you agree that it is significantly easier to debug a 100 lines python script than a 100 lines SQL query.


I often watch in horror what some people suggest is 'good SQL'. You even find plenty of those humongous multi-hundred-line beasts on the official PostgreSQL wiki, a single statement going over several pages if you were to print it out. Sure that's hard to debug. What I do in such cases is using many intermediate views, sometimes just named like `_foobar_010`, `_foobar_020` and so on, and a single `create (materialized) view foobar ... select from _foobar_120 ...`. That way, I can rely on the DB caring for basic structural soundness, giving error messages that are better localized than "something is wrong near ','", and I can visually (or, indeed, programmatically) debug all the intermediate steps.


This is just obviously selection bias, the stuff people put in the wiki is exactly the set of things that are so hard to do it's worth having a reference for other people. The stuff in the postgres wiki is in no way representative of sql in general.


That you don't know SQL very well isn't an argument against it? Give me a 100 line SQL statement and I'll have it figured out in 20 minutes, about the same as it would take to figure out bad C++ or Python or whatever else.


I have done that. This sounds more like a developer who is not truly skilled in SQL than an SQL issue.


I like to think that a few weeks with your head in the trenches of SQL and you come out a competent SQL developer on the other side.

I don't mind writing a query, or even a complex query at all. Sometimes it can be draining if you need to nest more queries than you'd like rather than have some joins, but I personally think SQL is a fine language.

Could just be Stockholm Syndrome.


The SQL language design is more contemporary to COBOL than anything we'd choose to use today.

That's not implying that large, successful projects cannot be built upon SQL. We see them in the real world all the time.

We have awesome implementations of the language, but any productivity is in spite of a quirky language designed in the early 1970s, not because of it.


It's a spurious comparison. I wouldn't pick COBOL as part of my stack for a greenfield project. I would definitely 100% pick SQL as part of my stack for a greenfield project.


Because COBOL has been replaced with better languages for most purposes. SQL is entrenched. I'd also pick SQL for a greenfield project, because... what else?

This says nothing about the quality of SQL as a language.

What I think is more interesting is: if you were to greenfield a language for a relational database, how much would it look like SQL?


Considering (a) how big and important its domain is; (b) its utter lack of any credible competition whatsoever; (c) the wealth of attempts to displace it, yes, I'd say it says something about the quality of SQL as a language. As a language, i.e. as a vehicle to concisely, effectively, efficiently expressing ideas, it's unquestionably successful.

> if you were to greenfield a language for a relational database, how much would it look like SQL?

Save for secondary concerns like syntax, it would be exactly the same in all respects except easier composability.


utter lack of any credible competition whatsoever

There is no low-level API to SQL servers to create competition, so this is like saying CSS has no competition in browsers whatsoever. Yes, it does not, because it was never allowed. Let's go make a browser competition as part of our weekend, or at least RDBMS, sure /s. ORMs with their "relations" etc are basically better DSLs over SQL databases which try to hide rough edges from an innocent user.

concisely, effectively, efficiently expressing ideas, it's unquestionably successful

Yeah, "group by case when cond1 then v1 when cond2 then v3 else v3 end" instead of "group by alias1". Because, you know, it's declarative, but not really: https://stackoverflow.com/a/3841804/3125367

Or fetching thousands of rows out of explosion of "select t.a, t.b, p.propname, p.propvalue from items t left join props p on t.id = p.item_id". Or inability to join two different subtables like 'props' and 'per_city_prices' simultaneously without exploding into infinity. Of course you can always make 3 queries and join at your place, which is neither effective nor efficient nor concise. A business wants simple [{a, b, props:[{name, value}, ...], per_city_prices:[{city, price}, ...], ...]. Can SQL do that? It can not.

Or getting corresponding non-aggregate values along with min/max results, which requires very funny self-joins if it's already e.g. a windowed aggregation.

SQL is cool for these relational... matrices(?), but sucks as a programming language in both syntactic and semantic parts. Some servers fixed few dumb parts of it, but not generally.


> There is no low-level API to SQL servers to create competition, so this is like saying CSS has no competition in browsers whatsoever.

Yes... but there is only one Web, and many databases. Some of them don't use SQL. They are, in my opinion, and in the opinion of many other developers, inferior. They lost in the market in a fair fight.

Also if SQL sucks so much you can compile to SQL, like TS->JS. Many such solutions exist. All worse than SQL.

> ORMs with their "relations" etc are basically better DSLs over SQL databases which try to hide rough edges from an innocent user.

ORMs are developed by software engineers for software engineers. It's kind of silly to picture them as "innocent" or incapable to understand the relational model.

> https://stackoverflow.com/a/3841804/3125367

Does having an operational semantics make a language not declarative? That's not how I use words. Is Haskell not functional because it runs on a GRS?

For the rest of your post you're arguing about either syntax, stuff you solve with CTEs, or stuff you would just hand off to the programming language, so I'll just leave it at that.


So the hard parts were left at that and easy parts were dismissed as usual because one can CTEs all the way down. I shut up and go write three queries, manual joins and funny five-story self-joins, cause there is no problem. Who said it’s a language problem when it’s mine. Thanks for the explanation!


> because... what else?

Perhaps SQL is so good that it couldn't be replaced all these years. But yes, debugging could have been better.


I have a better comparison.

It is as though we lived in a world where, like our own, Lisp invented garbage collection. But unlike our own world, in 2022, Lisp is the only garbage collected language anyone uses. Others were invented but that mostly stopped by the 80s.

Many people would like to replace Lisp, and keep in mind this is Lisp so lexical scope is only available in some implementations and people who want cross-platform compatibility don't use it. But without it, you have to manage your own memory.

That's SQL and relational databases. Relational databases with ACID guarantees aren't optional, they aren't a nice-to-have, they're foundational.

And we talk to them in SQL because... we talk to them in SQL.


If it has survived this long without being replaced by something better it must be really good.


haha my job makes me rewrite sql that Data analysts come up with into spark jobs in scala. Their main argument is that sql isn't unit testable like scala.

This grunt work of sql =>spark translation is easily the most boring part of my job.

Other than that, I do kind of agree that sql isn't really good at large scale programming. There is no ide support while you are working, hard to extract common logic , hard to reuse , hard to compose, hard to unit test. All these things have been mainstays of conventional programming.


That's also my experience on current programmers, they don't know how powerful and fast the database system and what it can do and implement the routine in the application level.


Why? I'd say its true


Case in point, the generally-loved C# feature, LINQ, is modeled after SQL. Except for a nonexistent (or at least not standardized) type system, SQL's just fine for data modeling and often analysis.


As a data point from a LINQ fan - I love it precisely because it improves on SQL, in terms of composability, regularity and allowing some amount of meta-programming.


I took a DB implementation class my senior year of college (I still giggle at our team name: "YourSQL"). I wrote our parsers. First was the DDL, and I wrote the lexer and parser by hand (this is all in C). It's not necessarily a hard task, but dealing with where the hell that segfault is coming from takes a lot more time. I convinced the professor (after a short essay) to let us use lex and yacc since even PostgreSQL uses them. Made the job so much easier to write and modify.

I often see people saying parser generators are a waste of time for various reasons. There are a good few cases I'd agree in:

- Performance is critical

- You want lots of power for error reporting

- You really want to learn about recursive descent parsers by writing on by hand

Personally, the least interesting part of a compiler after the first time is lexing and parsing for me. Code gen, type checking, optimization, etc are all the real meat and potatoes for me, and I find parser generators to be a great time saver.

In general, I see generated code becoming less common (could just be me), which I don't think is a bad thing. I see a lot less CFG parsing tools these days too, aside from parser combinators. Could just be me not paying attention, but it's in interesting way we've progressed.

If I was working on a new language - hobby or professional - I'd start with parsing tools and swap them out when it became worthwhile.


Parsing SQL can solve many problems, but before writing/choosing a parser you need to know what your goal is. For example, at work we wanted to have statically-typed SQL statements in Python/mypy to statically identify syntax errors, misspelled column names and bad type mixes. We came up with a home-grown SQL parser [1] and type checker [2] in Rust and a mypy plugin [3]. If someone knows of other solutions for obtaining SQL/mypy integration I'd love to know about it, since parsing SQL is not exactly supposed to be our field of work.

[1] https://github.com/antialize/sql-parse/blob/7869595aa92aed0b... [2] https://github.com/antialize/sql-type [3] https://github.com/antialize/py-mysql-type-plugin


It's probably not advertised enough, but jOOQ also contains a parser that can translate between dialects, and let users work with the expression tree to achieve custom SQL translations. A demo here: https://www.jooq.org/translate


Yes, parsing sql is not straightforward as there are many dialects.

But the rest of the article is confusing. There’s not very many instances where someone needs to parse sql - there are ORM libraries that generate the relevant sql dialect. This has been the case since the early 2000s; see Hibernate. And for C#, there’s LINQ to SQL.

The last sentence then goes into the “sell”: if you need help parsing sql, contact us at xyz…


> there are ORM libraries that generate the relevant sql dialect. This has been the case since the early 2000s; see Hibernate. And for C#, there’s LINQ to SQL.

That's generating SQL. Parsing is going from text to the AST; the reverse of what you're saying. Maybe I'm confused.


yes, apologies, I skipped ahead.

Parsing: "There’s not very many instances where someone needs to parse sql" The only use cases would be building your own database/query optimizer and having SQL be the API.

There are plenty of other tools already to help programming languages interface w/SQL, give compile time checks, test, generate SQL, etc.

Build the parser for fun, sure, but don't then offer consulting services on an article that didn't really specify a business problem or outcome and just talked about tech stuff.


He actually lists at the end as an alternatives to his consulting:

Consider if you can use existing tools or libraries to process SQL code. Some of them even support multiple SQL-dialects or multiple programming languages. If you can use any of them for your needs they should be your first option

If you want to build a solution in-house you may consider starting from the few ANTLR grammars available for the major SQL databases. They can really help you get started in parsing SQL

> There are plenty of other tools already to help ... give compile time checks

Which are these?

> ..that didn't really specify a business problem or outcome and just talked about tech stuff.

The business problem is to build an sql parser. the outcome being an SQL parser. It's clear enough and it's something I had to do. Your criticisms are unreasonable.


I can think of one: deriving a type for a prepared statement.

To do that you have to parse the schema to get the table types and work backward. That lets the analyzer decide what sorts of data may be legally passed to a given prepared statement, and produce type failures at compile time.

There are more, that's the first which came to mind. Disclosure: I'm writing an SQLite PEG for a number of reasons which might hypothetically include that one.


One parser I think deserves a mention is the one from Apache Calcite[0]. Calcite does more than parsing, there are a number of users who pick up Calcite just for the parser. While the default parser attempts to adhere strictly to the SQL standard, of interest is also the Babel parser, which aims to be as permissive as possible in accepting different dialects of SQL.

Disclaimer: I am on the PMC of Apache Calcite, but not particularly active at the moment.

[0] https://calcite.apache.org/


Although I truly appreciate and admire Calcite, I've had so much trouble trying to get it to do what I want. Although in all fairness it could be me and being able to find enough time to deep-think on it.

Specifically, customizing the DDL grammar was really hard to understand and play with. It's been a while and maybe it's better, but I remember there was some kind of freemarker compiler templating language that had to be modified and documentation and posts on different forums was lacking to say the least.

I was trying to do some very unorthodox things at the time, but Calcite is probably a good fit for 98% of the things one might need to do with an SQL compiler/parser.


I agree that customization can be a complicated. Adding more tutorials and documentation is something that the community is definitely interested in.


If you don't want to do it yourself, there's this:

https://github.com/google/zetasql

Parsing is huge but it's amazing how small a part of the job it is. This library isn't even the half of it.


"you cannot do everything related to data in SQL. In some cases, you need a traditional programming language to work with the data"

Them's fighting words :)

But point taken regarding the benefit of modern language tools applied to query-based work.


I inherited a graphql project that was a work of art in that respect. All access control etc was done in the database and triggers were effectively used the works.

The resulting application was a complete performance failure. Databases are slow and generally involve latency. They are a source of truth and an engine for querying.

They are not meant to be your applications working data model.


Especially since its literally untrue. SQL, with the addition of recursive queries (supported by all since the late 90s AND used in an example within this very article), is Turing complete. There's literally no operation on data that can't in principle be expressed as an SQL query (albeit the construction might be painful and tedious).


Do a dynamic pivot (I'll wait.)

But in all seriousness, DataFrame centric operations are a superset of SQL (you can always do a df.sql("...") if you want to ) and have a lot more efficient implementation of both OLTP/ORM requirements and OLAP/DS/BI requirements.

They also encourage composability, modularization, reuse, unit and data testing ...

So it's ironic I feel like SQL's replacement is another declarative language - its original inspiration - plain English.

Just a natural language transpiler (like Palantir's Ontology plus Looker's Malloy (reverse disclaimer: I do not work for or enjoy either of these products but these underlying concepts are correct)) with some fancy domain heuristics and light AI (I suspect a Pareto like model that supports 80% of use cases only needs a semantic graph with a few thousand nodes and vertices)


> There's literally no operation on data that can't in principle be expressed as an SQL query (albeit the construction might be painful and tedious).

There are some problems though:

- Single query can't write to multiple tables.

- Single query can't return multiple resultsets.

You can write a stored procedure, but this is no longer "an SQL query", strictly speaking. And once you start writing stored procedures, you are no longer using just SQL, but whatever Ada-inspired procedural extension to SQL was implemented by the database vendor. In other words, you are using "a traditional programming language to work with the data".


A single query can write to multiple tables, using CTE's in PostgreSQL for example.

You could compose a SQL query that allows you to map multiple resultsets to 1 resultset, although that feels a bit awkward.

    WITH a AS (
        insert into a (k, v) values ('a', 1.0) returning *
    ), b AS (
        insert into b (k, v) values ('b', 2.0) returning *
    )
    SELECT
        row_to_json(a)
    FROM
        a
    UNION ALL
    SELECT
        row_to_json(b)
    FROM
        b;
Returns:

        row_to_json        
    --------------------------
    {"a_id":1,"k":"a","v":1}
    {"b_id":1,"k":"b","v":2}
    (2 rows)


That is eye-opening but - if it actually works and I find it hard to believe - is no way ...

ah, the insert is a CTE because it produces a value ('returning' I guess). Hmm. This is very odd. Doesn't seem to work in mssql.

Well thanks for the can of worms...


mssql CTE support is very very basic, to the point of being not very useful.


What the hell are you talking about?


You can only select, you can't nest cte, the query planner has no understanding of joins that cross cte boundaries so they are totally unoptimized, you can't use distinct or group by or etcetc. Really the CTE implementation in SqlServer is basically a parser level hack.


> You can only select

   with x as (...)
   update x set ...
> you can't nest cte

ok but you can linearise them

   with x as (...), y as (...)
> the query planner has no understanding of joins that cross cte boundaries so they are totally unoptimized

utter, reeking garbage.

> you can't use distinct or group

more garbage. I have. Show me an example of it not working.

(Edited for less rudeness)


so you can do this?

  WITH t AS (
      DELETE FROM foo 
  )
  DELETE FROM bar;
Or can you only use SELECT for WITH queries? Did you not realize that other databases and the SQL standard allow you to do this?

Yes on the final query of the CTE you can do all sorts of things, but that's way less useful if you can't do them in all the component queries.


Show me any sql dialect that will allow that CTE you give here (edit; and what on earth is that supposed to actually mean)

> Or can you only use SELECT for WITH queries

you can only use a select inside a cte (or should be able to) because the 'e' stands for 'expression'. It seems postgres does allow an insert with and output which sort of makes sense but I doubt it's in the standard.

Your last sentence makes no sense to me. Give an example.

Also you failed to give an example that group by/distiinct weren't allowed in ctes.


Good to know, I wasn't aware PostgreSQL supports this.

I'm currently on SQL Server and it doesn't support INSERT as a CTE (and I think most DBMSes out there still don't). It would definitely make my life easier if it did...


It's not the only DB or way of doing it either. Snowflake supports multi-table inserts for example:

https://docs.snowflake.com/en/sql-reference/sql/insert-multi...


> There's literally no operation on data that can't in principle be expressed as an SQL query

That may be technically true (or not, I don't know), but in many cases some complex data manipulation (especially when it's done in multiple passes) practically needs a lot of ram and a programming language to be more time efficient.



Technically true, but sql has the ability to induce discipline - every single time I’ve seen folks given the opportunity to do the analysis off a database or warehouse and use pandas/spark, they almost always end up writing bad code, processing one row or cell at a time. Almost always, they could have done it with sql itself and gotten much better performance but this “extra flexibility” allowed them to be lazier.

Sql can also have the same issue with feature creep (snowflakes array columns often invites people to write pathetic sql trying to join on this col for example) but with the right design you can force your users to interact with your data more efficiently.


This article is all over the place.

Anyway, the author would have been better off writing a query to generate the class definitions directly from the data dictionary (if available). INFORMATION_SCHEMA (and other proprietary implementations, like the DBA_ views in Oracle or sys views in SQL Server/Sybase) is just perfect for this. An order of magnitude easier than writing a parser.


Depending on your specific requirements, querying the data dictionary isn't always sufficient; in particular, when there is a chance for changes to the schema.

An example: in case of Debezium's MySQL connector (a change data capture platform), we need to know the table schema in order to interpret incoming change events from the MySQL binlog. Due to the async nature of this process, such event may adhere to an earlier schema version, which is different from how the table looks like in the data dictionary now. For instance, a column's type may have changed between the point in time when that change event was created and now. That's why indeed we also implemented a full MySQL DDL parser for Debezium, as DDL events show up in-stream and thus allow us to react to schema changes at the right moment and update our model of the table structures accordingly.

Now, ideally, data dictionaries were revisioned, so that indeed we could query history dictionary entries (with actual change events containing a schema revision id), but I'm not aware of any database which supports that. Touched on this briefly in a talk I did earlier this year for Andy Pavlo's database group at CMU [1].

[1] https://speakerdeck.com/gunnarmorling/open-source-change-dat...


Yes. Parsing SQL solves some rather specific problems, which the article doesn’t seem terribly clear about.

Interrogating the running database to establish schemas and types is exactly what my Postgres/TypeScript library[1] does, for example.

[1] https://jawj.github.io/zapatos


You've completely missed the point. Author's talking about parsing SQL. That's utterly different from using infoschema to get an existing table structure.


No, I have not. The article has no clear direction, hence your comment. The article talks about parsing SQL DDL with the goal of creating classes/structures that represent the tables. You do not need to parse SQL to achieve this. The premise at the start of the article is "parsing SQL so that we can achieve things in code that is not possible in SQL". If that were indeed the case, there would be no such SQL to parse in the first place!


Perhaps I missed the point too, but the way I understand the post is that it's demonstrating how to parse DDLs so that a model (AST) of the actions and structures can be extracted.

From there an implementation can be defined to create, alter, drop, etc the respective objects and then probably update some kind of manifest that represents the objects and potentially its relationship to others. In most databases that manifest is the INFORMATION_SCHEMA.


> creating classes/structures that represent the tables

AFAICS he's talking about creating AST objects.

If the tables don't exist, you can't query the infoschema because the tables don't exist. Someone has to parse the table-definition source code to work out what the tables must eventually look like when turned into structures in the database.


On a related note, does any one know of a parser + planner library that allows you to spit out a query plan (either logical or physical) in a serializable form (e.g. JSON)? For example, say I want to parse a query on the client side into a basic plan (that might be further optimised on the server side).


It would always be nonsensical unless you were running on the database you were targeting (or one that contained similar data distributions).

However, with Postgres its pretty trivial to get a JSON format of the output if you needed a machine parseable answer. I don't know of a library that does this for you, but it wouldn't surprise me if one existed.

    explain (format json, analyze, buffers) select 1


Great, thanks. How modular is the postgres codebase? Would it be relatively easy to factor out the part of the code that does the above from the rest of the database (e.g. from the execution or storage engines)?


The hasql-th[0] package implements an sql parser to give better error messages, here's an example:

Consider the following broken statement:

select 1 from a where b >= 3 && b < 4

It is incorrect, because it uses && instead of and. But here's what Postgres' original parser says about it:

ERROR: syntax error at or near "<" LINE 1: select 1 from a where b >= 3 && b < 4; ^

Here's what "hasql-th" says:

  |
2 | select 1 from a where b >= 3 && b < 4; | ^ unexpected '&'

0: https://github.com/nikita-volkov/hasql-th#error-example-1


Open Source SQL Parsers

https://news.ycombinator.com/item?id=31107231

2022-04-21 125 points, 102 comments


If you're looking for a good sql library to use I did a review/survey of parsers in a bunch of major languages. Most of them are full postgres or mysql parsers.

https://datastation.multiprocess.io/blog/2022-04-11-sql-pars...


Fancy adding the jOOQ parser to your list of Java parsers? https://www.jooq.org/doc/latest/manual/sql-building/sql-pars...


Neat! I'll give it a shot.


Recently build a simple SQL Parser for sequelize

https://github.com/wenerme/js-miniquery


On a slightly related note, does anyone have any good recommendations for a parsing library in nodejs or rust for parsing advanced search queries? Similar to twitter’s advanced search?


In py spark

parser = self.spark._jsparkSession.sessionState().sqlParser()

parser.parseExpression(some sql)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: