Hacker News new | past | comments | ask | show | jobs | submit login
Q – Run SQL Directly on CSV or TSV Files (harelba.github.io)
121 points by thunderbong on June 7, 2021 | hide | past | favorite | 63 comments



Yesss. Glad to see this getting some love. I use this tool very frequently in my day-to-day and my work would be so much harder without it.

A lot of folks mentioning great projects/solutions that work the same but I love me some good unix piping action:

    cat data.csv | xsv select 1,3 | q 'select * from - where col1 != "foo"'
(xsv: also invaluable on the terminal: https://github.com/BurntSushi/xsv)


Is that any different than using sqlite3 directly like this?

    cat data.csv | sqlite3 -csv ':memory:' '.import /dev/stdin data' 'select col1, col3 from data where col1 != "foo"'
You can of course use the .import command directly on the file, but keeping it there to show piping.


sqlite has become my preferred way to work with csv files in an adhoc manner. I do it often enough that I wrote a small wrapper around sqlite to make it easier to work with multiple csv files with overlapping column names, and to normalize column names: https://github.com/psanford/csv2sqlite


Using sqlite3 for any task makes me so happy, I try to use it even when it's overkill


If you want the best of two worlds, you can also give bigbash [1] a try. It generates a bash statement using unix tools from a sql query that can be run on csv files.

[1] http://bigbash.it/


Is that better than:

  cut -d, -f1,3 data.csv | grep -v '^foo'
I'm sure there are some things where you need sql/q, but this isn't one of them, and there is a lot of utility in productive with bare POSIX tools.


Expressing an 'or' or having any logic to the condition becomes quickly annoying with usual *nix tools, and most people use SQL much more often than they forget how to use sed and awk.


egrep '(a|b)'


If always splitting on ',' is correct for your data, then your solution is probably fine. You probably want '^foo,' though, otherwise yours includes things where 'foo' is just a prefix of the first column value.

But CSV permits quoted fields that contain literal ','. So in that case, splitting on ',' would be incorrect.


This breaks down when you have commas in your CSV data like if you had something like locations as the quoted string "Atlanta, GA"


And yet another set of tools is vnlog (https://github.com/dkogan/vnlog). It assumes you have whitespace-separated data, with a legend comment, then you can do the above thing with

  < data.vnl vnl-filter -p thiscolumn,thatcolumn 'thiscolumn != "foo"'
If you don't strictly need csv, those tools are quite nice, and have a really friendly learning curve. Disclaimer: I'm the author


Why use xsv in this case (as opposed to SELECT col1, col3)?


Mostly for just demonstrating how it combines with other tools in my workflow over pipes :D


Even shorter with r and probably a good deal quicker

r -e "data.table::fread(data.csv)[col1 != "foo", .(1,3)]"


I don't know anything about R and I couldn't get this to work. First, I had to install R. Then it told me that 'data.table' was unrecognized. So I searched around and installed that. Then, 'data.csv' is not quoted. So fixed that. So I tried running it on some data I have:

    > data.table::fread("version_downloads.csv")[date == "foo", .(1,3)]
    Error: character string is not in a standard unambiguous format
    Execution halted
Dunno how to fix this now.


Might be the datatypes. fread automagically tries to determine the datatypes of your columns by jumping around the file as it's being read.

https://www.rdocumentation.org/packages/data.table/versions/...

Or you can pass colClasses = 'character' as an argument to read everything as a string, but that will be much much slower.

fread also reads from stdin so you can pass bash commands into the first parameter instead of using bash to run r via R -e.


I guess that's nifty, but it makes it much less universal. It's also not clear to me why reading everything as a string would be slower. If anything, it seems to me like it would be faster, since you wouldn't need to do type guessing.

Full disclosure, I'm the author of 'xsv' and the thing I was most interested in here was performance.


Fix your strings. :)


That's not helpful. CSV data is what it is. If you're advertising an alternative for processing CSV data but it chokes on valid CSV, then that's kind of a show-stopper IMO.

And I have no idea which string to fix or how to even fix it. The failure mode here is not great.


If you like SQL interfaces then please also checkout Steampipe (https://steampipe.io).

Use SQL to instantly query cloud resources like AWS, GitHub, Yahoo Finance, etc (no CSV yet). It's written in Go and uses Postgres Foreign Data Wrappers (similar to SQLite virtual tables).

Disclaimer: It's open source. I'm a lead on the project.


This is really cool. I don't understand something about the architecture though -- you mention it relies on Postgres FDW. When you install Steampipe are you installing Postgres locally and using that, or are you using a remote postgres server?


Steampipe automatically installs and manages an embedded Postgres instance for you. It's shocking how lightweight that is - another way in which Postgres is awesome.

Running "steampipe query" will start and run embedded Postgres for the duration of the session. Using "steampipe service start" will run it in the background for access from your favorite 3rd party SQL tool.


This looks awesome. Thanks for sharing.


FWIW, csvkit has similar functionality in its csvsql tool:

https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html

    csvsql --query  \
        "select avg(i.sepal_length) from iris as i join irismeta as m on (i.species = m.species)" \
         examples/iris.csv examples/irismeta.csv
I believe it creates an in-memory SQLite database of the CSV, then executes the query and produces the CSV result.

edit: Looking through q's docs, I see that it also uses Python and SQLite in-memory databases:

http://harelba.github.io/q/#implementation

> The current implementation is written in Python using an in-memory database, in order to prevent the need for external dependencies.

I'm confused though that, according to its Limitations[0] section, it doesn't support CTE's or `SELECT * FROM <subquery>`, when both of those things are supported by the SQLite standard?

[0] http://harelba.github.io/q/#limitations


IMO it would be nice to say outright that this uses sqlite under the hood instead of saying it supports "sqlite's syntax" and so on.

It can be very useful, but this is a wrapper that constructs a sqlite db from CSV/TSV (similar to sqlite's '.import') and then runs SQL on it, right?


> [...] this is a wrapper that constructs a sqlite db from CSV/TSV (similar to sqlite's '.import') and then runs SQL on it, right?

It appears so: https://github.com/harelba/q/blob/master/bin/q.py#L267

Regardless, simple utilities that expose SQL interfaces over arbitrary data are still incredibly compelling & useful.


I think I've seen several similar projects to this one. It always make me wonder how are these different from sqlite virtual tables (https://www.sqlite.org/csv.html)? And why would I want to use this over sqlite? Perhaps some API in python? But then again I could just run sqlite script and dump results into temp csv and read them back in any other lang of my choice.


I think the value here probably depends a lot on what it's doing under the hood. Like, if this is just a wrapper on importing to sqlite, building up an index, and running a query, then maybe there aren't a lot of use-cases.

But maybe it does something interesting with the fact that it can receive data in a stream?


There actually is a Python API for doing this: https://pypi.org/project/dataframe-sql/

It doesn't expose a CLI, though.


Where would the world be if we just would stop innovating after the first idea that works?


Nicely thought out set of features!

If I understand it correctly, the main value is the automating of SQLite import and datatype-assignment (vs affinity).

Combined with ability to save the resulting db also makes this tool a useful import wizard. Just one needs to pay attention to csv-headers and backtick-enquote the column names which contain spaces (something which is not very commonly done in csv).

I agree, picking a single-letter name for the utility should be rather left to user's own alias choice, if needed.

A more descriptive name would better integrate within the already populous namespace.


You probably don't want to be arsed fiddling with it, but the Windows ODBC driver supports access to CSV files via SQL queries.


Since Access exists actually.


This is neat. Thanks for sharing.

Another great tool for working with csv files on the command line is the excellent VisiData - https://www.visidata.org/

It lets you explore/browse/search all your csv/json/etc files super quickly and easily in an interactive command line tool, sort of like the old Norton Commander.


That's excellent! Thanks for posting. I've seen most of the tools in this area, and visidata is a new one to me. And it actually does lots of things the others do not.


I've used the similar LogParser[1] on Windows here and there bit over the years.

[1]: https://en.wikipedia.org/wiki/Logparser

On Windows, you can also just create an ODBC connection to any CSV file using the Text Driver, and use any ODBC-compatible toolchain you have available.


Yes, when using MS logparser allowed me to magic up some ad hoc reports in minutes.


Not to diss the author or chime in with the umpteenth "this is easy why did you spend time on this"...

A postgres db can query from external files as well. See:

https://www.postgresql.org/docs/13/file-fdw.html


The major difference is seemingly that one is a database installation and the other is just a command line tool. What if I don’t have a Postgres DB, but I have some tabular data in a csv I want to query? Should I set up a database to query a file?

I’m not sure why I would set up a database to run a query on a file when I can do it from the command line.


Well, running postgres doesn't really need to be much more than `docker run --rm -v $PWD:/data -e POSTGRES_HOST_AUTH_METHOD=trust postgres`. While docker might have its downsides, it is convenient for this sort of ad-hoc stuff.


Looking at the comments I see there are more and more such projects, that's great! Declarative interfaces rule!

Anyways, I too felt a void to be filled with tools like these, so I (and a couple of other folks) developed OctoSQL[0], check it out if you like this. It lets you query json, csv, and various databases using SQL, and also let's you join between them.

It differs from most of the other tools in that it also supports streaming data sources using temporal SQL extensions. (Inspired by the great paper, One SQL To Rule Them All[1])

[0]: https://github.com/cube2222/octosql

[1]: https://arxiv.org/abs/1905.12133


q is awesome. Initially I wasn’t sure what the underlying engine was so was amazed at how good the sql support was. Turns out it feeds all the data into sqlite under the covers so you can do some pretty complex queries against arbitrary csv files in a one liner.


Shameless plug, I also built a tool in Rust to provide SQL/GraphQL query access to CSV and many other tabular file formats: https://github.com/roapi/roapi.



For completeness sake, I would like to mention, that Apache Drill works with local files and can be used for exactly the same purpose.



ClickHouse includes a tool which allows to run SQL queries over files/data in any supported format including TSV [1]

E. g. you can use this command to see which unix users use most memory (RSS):

ps aux | awk 'NR > 1 { print $1"\t"$6 }' | clickhouse-local -S "user String, mem UInt64" -q "SELECT user, formatReadableSize(sum(mem)*1024) as mem_total FROM table GROUP BY user ORDER BY sum(mem) DESC LIMIT 20 FORMAT PrettyCompact"

It is pretty fast - I've generated 7.7G test data for this query and it took 23 second to run the query above (using multiple threads). For comparison wc -l for this file takes 9 seconds.

[1] https://clickhouse.tech/docs/en/operations/utilities/clickho...


Not to be confused with the also very SQL-like Q language[1] from Kx Systems/Arthur Whitney.

[1]: https://en.wikipedia.org/wiki/Q_(programming_language_from_K...


Indeed they're very similar (including the name and the documentation layout!) so I was very confused for a bit.


The implementation part is actually irrelevant.

The nice part is that it is CLI, like jq(1).

I did something similar a long time ago with DBD::CSV but lost the code. So I'm glad the idea resurfaced.


When working with large csv files, you're better off importing the file into a sqlite database if you want to perform more than one query.

Knowing the import is happening behind the scenes is useful when you hit performance problems with the tool.


Using a local cache might indeed improve for following queries.

But I managed to be able to "stream" results. When it didn't involve any GROUP BY, of course.

I guess that would spawn a nice side project, to rewrite it in a language to learn it ;-)


Here's a similar project I've been working on in my spare time: https://github.com/mslusarz/csv-nix-tools. It supports both SQL syntax (SQLite and native) and Unix-like processing (grep, sort, uniq, etc). It also includes source (ls, ps, etc) and exporting (exec, to-json, plot) tools.


Many SQL tools that use JDBC have support for CSV, TSV and other table-in-a-file data stores (e.g. DBF).

I use DBeaver for nearly everything SQL. It is also open-source.

Create a New Database connection, choose CSV, choose the folder with the CSV files, and you will get a database connection to a "database" where each CSV file is a table.


I second DBeaver, fantastic tool. AFAIR it can't do joins though (on CSVs), whereas Q in the OP can.


In days gone by, MS Logparser did the trick: https://docs.microsoft.com/en-us/previous-versions/windows/i...


Another one: RBQL (https://rbql.org/) - It doesn't use sqlite (or any other database) and provides either SQL/Python syntax or SQL/JavaScript syntax.


Discussion from a couple years ago:

https://news.ycombinator.com/item?id=18453133


I use this to search for transactions in my bank transaction history for taxes. Very useful!


This sounds similar to Amazon Athena, but without the Amazon part. Interesting!


Worst name for a proyect in 2021.


Looks useful and well done, but the choice of name is unfortunate since it clashes hard with the Q language and the q-sql functions it implements (and also given some Q users would definitely find this interesting)


Even worse, might confuse us Q followers. ;-)




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: