Dsq: Commandline tool for running SQL queries against JSON, CSV, Parquet, etc.

eatonphil · on Jan 11, 2022

I shared the first version of this in a Show HN last month [0]. The big update since then is that it now supports loading multiple files and doing SQL joins on them. You can see examples of that in this post.

The repo is here [1] and the README has a comparison against some of the other great tools in the space like q, octosql, and textql.

[0] https://news.ycombinator.com/item?id=29643835

[1] https://github.com/multiprocessio/dsq

voidfunc · on Jan 12, 2022

Can we get binary releases? There's some places I'd like to use this where I don't want to install go just to install the bin, but would prefer to just `curl` a binary. I didn't see any GitHub releases.

eatonphil · on Jan 12, 2022

Yup in fact there's an open PR for this right now! It should be available shortly.

voidfunc · on Jan 12, 2022

Awesome! Looking forward to it!

eatonphil · on Jan 15, 2022

Binaries are now available! https://github.com/multiprocessio/dsq#macos-linux

data_ders · on Jan 11, 2022

way cool Phil! I've been struggling to grok the value prop of data station for a while now, but dsq crystallizes it for me!

eatonphil · on Jan 11, 2022

Interesting! and thanks! It has definitely not been straightforward to explain. Mostly my fault not working on tutorials, docs and videos. Thankfully DataStation performance, features and install-time are now in a pretty good spot so I am going to be focusing more on the education side.

mason55 · on Jan 13, 2022

Any thoughts on supporting Avro? I notice there's no issue for it in the Github repo, curious if there is a reason for that. We're pretty much replacing all CSV interchange with Avro (at least where the other side can handle it).

eatonphil · on Jan 13, 2022

Planning to! Feel free to open an issue so you can follow along.

houqp · on Jan 11, 2022

Neat! I have also built a similar project in Rust https://github.com/roapi/roapi/tree/main/columnq-cli :)

eatonphil · on Jan 11, 2022

Nice project!

da39a3ee · on Jan 11, 2022

I really like this idea. jq is great too, but the jq language just seems to be too different from anything else for me, so it only works if I use it every day, which I don't. But I guess an issue is that I'll still need to use jq in order to get my data into the suitable array-of-objects format for dsq?

eatonphil · on Jan 11, 2022

For now yeah. Eventually I'd like to support pre transforms or something within dsq itself.

nonbirithm · on Jan 11, 2022

They really ought to be a standard, language-agnostic way for querying nested config file formats with identifiers by now. For all the flak XML gets, at least it has XPath. YAML has YAML Path, but it's only supported by a single Python lib.

emmanueloga_ · on Jan 11, 2022

Augeas comes to mind [1], although it seems like a pretty niche project (since you mention a "standard" ...).

1: http://augeas.net

simonw · on Jan 12, 2022

This is neat!

My version of this is `sqlite-utils memory`, but it only handles JSON, newline-delimited JSON and CSV/TSV at the moment: https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/

newman314 · on Jan 11, 2022

dsq references a benchmark done by q (https://github.com/harelba/q/blob/master/test/BENCHMARK.md) that indicates that octosql is significantly slower.

However, octosql's GH repo claims otherwise.

Does anyone have any real world experience that they can share on these tools?

eatonphil · on Jan 11, 2022

Yeah frankly the q benchmark isn't the best even though dsq compares not terribly in it. It isn't well documented and exercises a very limited amount of functionality and isn't very rigorous from what I can see. That said, the caching q does is likely very solid (and not something dsq does).

The biggest risk (both in terms of SQL compliance and performance) I think with octosql (and cube2222 is here somewhere to disagree with me if I'm wrong) is that they have their own entire SQL engine whereas textql, q and dsq use SQLite. But q is also in Python whereas textql, octosql, and dsq are in Go.

In the next few weeks I'll be posting some benchmarks that I hope are a little fairer (or at least well-documented and reproducible). Though of course it would be appropriate to have independent benchmarks too since I now have a dog in the fight.

On a tangent, once the go-duckdb binding [0] matures I'd love to offer duckdb as an alternative engine flag within dsq (and DataStation). Would be neat to see.

[0] https://github.com/marcboeker/go-duckdb

cube2222 · on Jan 11, 2022

Hey there! OctoSQL author here.

I've released OctoSQL v0.4.0 recently[0] which is a grand rewrite and is 2-3 orders of magnitude faster than OctoSQL was before. It's also much more robust overall with static typing and the plugin system for easy extension. The q benchmarks are over a year old and haven't been updated to reflect that yet.

Take a look at the README[1] for more details.

My benchmarks should be reproducible, you can find the script in the benchmarks/ repo directory.

Btw if we're already talking @eatonphil I'd appreciate you updating your benchmarks to reflect these changes.

As far as the custom query engine goes - yes, there are both pros and cons to that. In the case of OctoSQL I'm building something that's much more dynamic - a full-blown dataflow engine - and can subscribe to multiple datasources to dynamically update queries as source data changes. This also means it can support streaming datasources. That is not possible with the other systems. It also means I don't have to load everything into a SQLite database before querying - I can optimize which columns I need to even read.

OctoSQL also let's you work with actual databases in your queries - like PostgreSQL or MySQL - and pushes predicates down to them, so it doesn't have to dump your whole database tables. That's useful if you need to do cross-db joins, or JSON-file-with-database joins.

As far as SQL compliance goes it gets hairy in the details - as usual. The overall dialect is based on MySQL as I'm using a parser based on vitess's one, but I don't support some syntax, and add original syntax too (type assertions, temporal extensions, object and type notation).

Stay tuned for a lot more supported datasources, as the plugin system lets me work on that much quicker.

[0]:https://github.com/cube2222/octosql/releases/tag/v0.4.0

[1]:https://github.com/cube2222/octosql#readme

eatonphil · on Jan 11, 2022

Neat! I'll give your script a shot.

em500 · on Jan 11, 2022

Another interesting comparison would be clickhouse-local:

https://altinity.com/blog/2019/6/11/clickhouse-local-the-pow...

eatonphil · on Jan 11, 2022

Awesome post, thanks for the link. I had no clue Clickhouse could do this. Shows off some more CLI tools in there too.

The big issue with ClickHouse is the incredibly non-standard SQL dialect and the random bugs that remain. It's an amazing project for analytics but you definitely have to be willing to hack around its SQL language (I say this as a massive fan of ClickHouse).

I wonder: does this mean I can embed ClickHouse in arbitrary software as a library? I'd be curious to provide that as an option in dsq.

em500 · on Jan 11, 2022

> I wonder: does this mean I can embed ClickHouse in arbitrary software as a library? I'd be curious to provide that as an option in dsq.

Not sure, but it's Apache licensed, so you likely can make it work if you want to. But realize that clickhouse(-local) is much heavier than sqlite / duckdb based solutions: the compiled binary is around 200mb iirc.

eatonphil · on Jan 11, 2022

Ah yeah that's pretty large.

pabs3 · on Jan 12, 2022

TextQL is a similar thing for CSV/TSV:

https://github.com/dinedal/textql/

Also SQLite itself can query CSV files IIRC.

rahimiali · on Jan 11, 2022

If you’re curious how they’ve adapted SQL to query JSON, here is the trick: The JSON “Must be an array of objects. Nested object fields are ignored”.

eatonphil · on Jan 12, 2022

Yep, excel and ods files are even more restricted for the moment. I'd like to lift these restrictions eventually!

wolfi1 · on Jan 11, 2022

just for clarification: one needs datastation in order for dsq to run? so dsq is not self-contained? (I'm not familiar with the Go eco system)

eatonphil · on Jan 11, 2022

Good question! No you don't need to install DataStation to run dsq. dsq just imports Go libraries from DataStation during compile-time.

dsq is a totally standalone binary.

tanin · on Jan 12, 2022

Nice! I also built a similar app but it is an offline desktop GUI app, not CLI.

https://superintendent.app

It seems a few people here also built a similar app...

eatonphil · on Jan 12, 2022

Yep! This cli is actually based off of libraries within a gui app as mentioned in the article.

chrisweekly · on Jan 11, 2022

See also "lnav" -- https://lnav.org -- for an awesome CLI powertool with embeded SQLite.

cschneid · on Jan 11, 2022

Thank you for this, it's a tool I didn't realize I was missing.

replwoacause · on Jan 12, 2022

Wow this looks awesome, thanks!

thriftwy · on Jan 11, 2022

I dream of a database which would use CSV (with metadata/indices in CSV comment lines) as its storage. You can even use commented-out padding to fit data to blocks, etc.

Imagine when you don't have to dump/convert data because you can always open it with OpenOffice.

sundarurfriend · on Jan 11, 2022

recfiles [1] are a sort of poor man's plaintext database that you can edit with any text editor. I found manually entering the data mildly annoying and repetitive, but visidata [2] supports the format, so I've been meaning to learn to use that for easier data entry and for tabular visualization.

[1] https://www.gnu.org/software/recutils/manual/recutils.html#I... [2] https://www.visidata.org/

Bonus: A nice concise intro that someone wrote last week: https://chrismanbrown.gitlab.io/28.html

LittlePeter · on Jan 11, 2022

In PostgreSQL you can use file_fdw extension:

https://www.postgresql.org/docs/current/file-fdw.html

bachmeier · on Jan 11, 2022

I think the parent comment was about reading and writing csv files. The documentation you linked says "Access to data files is currently read-only."

LittlePeter · on Jan 11, 2022

I missed that. You are right.

bachmeier · on Jan 11, 2022

What I've done in the past is:

Column 1 is the creation timestamp

Column 2 is the modification timestamp

On read, you update if the creation timestamp exists but the modification timestamp is later, otherwise you insert. Your app(s) can do a simple file append for writes. You even have the full version history at every point in time.

I did a lot of looking but didn't find a command-line tool that automated this process. It works fine for small projects of e.g. 100,000 records. Wouldn't work well for things like a notes app, because you'd be storing every modification as a new entry.

remcinerney · on Jan 11, 2022

That's interesting. I'm trying to imagine your workflow, and thinking about what serverless SQL platforms like Amazon Athena let you do now - i.e., you can more or less dump CSV files in blob storage and query them. Is that what you meant?

laumars · on Jan 11, 2022

AWS Athena can do this. Dump your CSVs in S3 and query them in Athena.

Personally I use sqlite for smaller datasets and wrap that around a CSV importer.

thriftwy · on Jan 11, 2022

Why dump CSVs when you can outright store in them?

laumars · on Jan 11, 2022

Sorry, I don’t understand your question. But just in case this answers it:

S3 is cloud storage on AWS. Athena can work directly off the CSVs stored on S3.

Where I said “dump” it was just a colourful way of saying “transfer your files to…”. I appreciate “dump” can also mean different things with databases so maybe that wasn’t the best choice of word in my part. Sorry for any confusion there.

cgio · on Jan 12, 2022

Dump is the proper term in this case, though, given S3 limitations (I.e. no append to file) which means you need to either create new file for each insert (very expensive) or append replace file for update. So practically it’s a workable read only replica with dumps of updates. For reasonably small datasets it can potentially work otherwise you look at rebalancing, partitions etc. and probably you’re better off with parquet,avro etc. given you are usually at the stage of introducing spark etc.

cerved · on Jan 11, 2022

I'm not sure I share that dream..

CSV is so horribly non-standardized and horrible to parse. JSON appears a much more suitable candidate

mr_toad · on Jan 11, 2022

PrestoDB. Or one of the many SaaS offerings like Athena, BigQuery or USQL.

zdwolfe · on Jan 11, 2022

This looks incredible! Definitely adding it to my toolbelt. Thanks!

avmich · on Jan 11, 2022

LINQ went the other way - a language for queries against (among other things) relational databases, different than SQL.

Wonder how, say, jq would look like working with tables in RDBMS.