sqlite has become my preferred way to work with csv files in an adhoc manner. I do it often enough that I wrote a small wrapper around sqlite to make it easier to work with multiple csv files with overlapping column names, and to normalize column names: https://github.com/psanford/csv2sqlite
If you want the best of two worlds, you can also give bigbash [1] a try. It generates a bash statement using unix tools from a sql query that can be run on csv files.
Expressing an 'or' or having any logic to the condition becomes quickly annoying with usual *nix tools, and most people use SQL much more often than they forget how to use sed and awk.
If always splitting on ',' is correct for your data, then your solution is probably fine. You probably want '^foo,' though, otherwise yours includes things where 'foo' is just a prefix of the first column value.
But CSV permits quoted fields that contain literal ','. So in that case, splitting on ',' would be incorrect.
And yet another set of tools is vnlog (https://github.com/dkogan/vnlog). It assumes you have whitespace-separated data, with a legend comment, then you can do the above thing with
I don't know anything about R and I couldn't get this to work. First, I had to install R. Then it told me that 'data.table' was unrecognized. So I searched around and installed that. Then, 'data.csv' is not quoted. So fixed that. So I tried running it on some data I have:
> data.table::fread("version_downloads.csv")[date == "foo", .(1,3)]
Error: character string is not in a standard unambiguous format
Execution halted
I guess that's nifty, but it makes it much less universal. It's also not clear to me why reading everything as a string would be slower. If anything, it seems to me like it would be faster, since you wouldn't need to do type guessing.
Full disclosure, I'm the author of 'xsv' and the thing I was most interested in here was performance.
That's not helpful. CSV data is what it is. If you're advertising an alternative for processing CSV data but it chokes on valid CSV, then that's kind of a show-stopper IMO.
And I have no idea which string to fix or how to even fix it. The failure mode here is not great.
If you like SQL interfaces then please also checkout Steampipe (https://steampipe.io).
Use SQL to instantly query cloud resources like AWS, GitHub, Yahoo Finance, etc (no CSV yet). It's written in Go and uses Postgres Foreign Data Wrappers (similar to SQLite virtual tables).
Disclaimer: It's open source. I'm a lead on the project.
This is really cool. I don't understand something about the architecture though -- you mention it relies on Postgres FDW. When you install Steampipe are you installing Postgres locally and using that, or are you using a remote postgres server?
Steampipe automatically installs and manages an embedded Postgres instance for you. It's shocking how lightweight that is - another way in which Postgres is awesome.
Running "steampipe query" will start and run embedded Postgres for the duration of the session. Using "steampipe service start" will run it in the background for access from your favorite 3rd party SQL tool.
> The current implementation is written in Python using an in-memory database, in order to prevent the need for external dependencies.
I'm confused though that, according to its Limitations[0] section, it doesn't support CTE's or `SELECT * FROM <subquery>`, when both of those things are supported by the SQLite standard?
I think I've seen several similar projects to this one. It always make me wonder how are these different from sqlite virtual tables (https://www.sqlite.org/csv.html)? And why would I want to use this over sqlite? Perhaps some API in python? But then again I could just run sqlite script and dump results into temp csv and read them back in any other lang of my choice.
I think the value here probably depends a lot on what it's doing under the hood. Like, if this is just a wrapper on importing to sqlite, building up an index, and running a query, then maybe there aren't a lot of use-cases.
But maybe it does something interesting with the fact that it can receive data in a stream?
If I understand it correctly, the main value is the automating of SQLite import and datatype-assignment (vs affinity).
Combined with ability to save the resulting db also makes this tool a useful import wizard. Just one needs to pay attention to csv-headers and backtick-enquote the column names which contain spaces (something which is not very commonly done in csv).
I agree, picking a single-letter name for the utility should be rather left to user's own alias choice, if needed.
A more descriptive name would better integrate within the already populous namespace.
Another great tool for working with csv files on the command line is the excellent VisiData - https://www.visidata.org/
It lets you explore/browse/search all your csv/json/etc files super quickly and easily in an interactive command line tool, sort of like the old Norton Commander.
That's excellent! Thanks for posting. I've seen most of the tools in this area, and visidata is a new one to me. And it actually does lots of things the others do not.
On Windows, you can also just create an ODBC connection to any CSV file using the Text Driver, and use any ODBC-compatible toolchain you have available.
The major difference is seemingly that one is a database installation and the other is just a command line tool. What if I don’t have a Postgres DB, but I have some tabular data in a csv I want to query? Should I set up a database to query a file?
I’m not sure why I would set up a database to run a query on a file when I can do it from the command line.
Well, running postgres doesn't really need to be much more than `docker run --rm -v $PWD:/data -e POSTGRES_HOST_AUTH_METHOD=trust postgres`. While docker might have its downsides, it is convenient for this sort of ad-hoc stuff.
Looking at the comments I see there are more and more such projects, that's great! Declarative interfaces rule!
Anyways, I too felt a void to be filled with tools like these, so I (and a couple of other folks) developed OctoSQL[0], check it out if you like this. It lets you query json, csv, and various databases using SQL, and also let's you join between them.
It differs from most of the other tools in that it also supports streaming data sources using temporal SQL extensions. (Inspired by the great paper, One SQL To Rule Them All[1])
q is awesome. Initially I wasn’t sure what the underlying engine was so was amazed at how good the sql support was. Turns out it feeds all the data into sqlite under the covers so you can do some pretty complex queries against arbitrary csv files in a one liner.
Shameless plug, I also built a tool in Rust to provide SQL/GraphQL query access to CSV and many other tabular file formats: https://github.com/roapi/roapi.
ClickHouse includes a tool which allows to run SQL queries over files/data in any supported format including TSV [1]
E. g. you can use this command to see which unix users use most memory (RSS):
ps aux | awk 'NR > 1 { print $1"\t"$6 }' | clickhouse-local -S "user String, mem UInt64" -q "SELECT user, formatReadableSize(sum(mem)*1024) as mem_total FROM table GROUP BY user ORDER BY sum(mem) DESC LIMIT 20 FORMAT PrettyCompact"
It is pretty fast - I've generated 7.7G test data for this query and it took 23 second to run the query above (using multiple threads). For comparison wc -l for this file takes 9 seconds.
Here's a similar project I've been working on in my spare time: https://github.com/mslusarz/csv-nix-tools.
It supports both SQL syntax (SQLite and native) and Unix-like processing (grep, sort, uniq, etc).
It also includes source (ls, ps, etc) and exporting (exec, to-json, plot) tools.
Many SQL tools that use JDBC have support for CSV, TSV and other table-in-a-file data stores (e.g. DBF).
I use DBeaver for nearly everything SQL. It is also open-source.
Create a New Database connection, choose CSV, choose the folder with the CSV files, and you will get a database connection to a "database" where each CSV file is a table.
Looks useful and well done, but the choice of name is unfortunate since it clashes hard with the Q language and the q-sql functions it implements (and also given some Q users would definitely find this interesting)
A lot of folks mentioning great projects/solutions that work the same but I love me some good unix piping action:
(xsv: also invaluable on the terminal: https://github.com/BurntSushi/xsv)