List of command line tools for manipulating CSV, XML, HTML, JSON, INI, etc.

chrisweekly · on April 8, 2018

[lnav](https://www.lnav.org) is a terrific little tool, a "mini-ETL" of sorts with an embedded SQLite client and a clean, powerful interface. Its sweet spot is logfiles, but given regex-based custom formats, works great with any semi-structured input. Lnav easily handles a few million rows at a time. IME it pairs really really well with eg mitmproxy/mitmdump for client request logs, as well as webserver logs.

hoistbypetard · on April 8, 2018

Thanks for linking that. It's going to make my life easier this week, and I had not heard of it. I was weighing setting up something like Graylog for some troubleshooting and kind of dreading it. lnav looks like a perfect middle-ground between that and my wiki page full of grep commands.

dancek · on April 8, 2018

This looks like a great resource. The tools you'd like to have for a specific problem are often quite un-googlable. So you either need complex hacks to get inferior tools to work or you spend an hour googling the tools for a tiny problem.

Of course, it would be even better if you could easily tell which of the dozen JSON query tools is the best choice for the task at hand, or which you should code if you only want to ever use one of them.

In fact I'd love if someone would like to share their set of tried-and-true tools. Personally I mostly go with the POSIX tools, plus jq or gawk on occasion (but I have to read their docs every single time...).

tannhaeuser · on April 8, 2018

Nit: awk is a POSIX tool, and has multiple implementations that you've probably used (Debian/Ubuntu comes with mawk, and Mac OS with nawk).

[1]: http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk...

dancek · on April 8, 2018

To nitpick even more, gawk (GNU Awk) is a superset of POSIX Awk. I'm not very familiar with the differences, but I always use specifically gawk---I got too annoyed with some of the BSD userland that ships with macOS, and learned to prefer GNU versions.

jwilk · on April 8, 2018

Besides mawk and gawk, Debian ships also the original awk:

https://www.cs.princeton.edu/~bwk/btl.mirror/

tannhaeuser · on April 8, 2018

Yes, and it works well on Debian. But I can't recommend switching to gawk on Ubuntu.

[1]: https://askubuntu.com/questions/1011414/gawk-is-crashing-for...

Lio · on April 8, 2018

This is great.

One thing I could suggest for the XML list is xmllint. It can be really useful for converting xml to canonical format so you can then use diff to compare it.

E.g. something like diff <(xmllint —c14 first.xml) <(xmllint —c14 second.xml)

I’d love to heat about more command line SOAP tools if anyone can recommend some.

arundelo · on April 8, 2018

I'll look into xmllint. I currently use HTML Tidy for this:

  tidy -xml -indent -wrap 0

or

  tidy -xml -indent -wrap 0 -quiet

IvarTJ · on April 8, 2018

xmllint also supports XPath queries.

geocar · on April 8, 2018

kdb+/q is another really good choice for dsv[1] and json[2]. You can certainly create single-file databases (if you really want to e.g. for exchange), but splayed table[3] is faster so you'd usually do that.

[1]: http://code.kx.com/q/ref/filenumbers/#load-csv

[2]: http://code.kx.com/q/ref/dotj/

[3]: http://code.kx.com/q/cookbook/splayed-tables/

dancek · on April 8, 2018

The problem with that might be the licensing costs once you use it commercially (eg. at work). IIRC the license prices aren't public, but you're looking at over $10k in any case.

I personally prefer J to K in the APL family of languages. They also have a relatively cheap database, Jd [1]. Individual licenses are $600. Still a bit too much for my data mangling needs. :)

[1] http://code.jsoftware.com/wiki/Jd/Index

geocar · on April 8, 2018

$10k isn't a lot (assuming that's right; it could be). I mean, it's a lot if you're used to something like MySQL or Postgres-levels of quality, but I've seen quotes for Oracle being almost $50k per core. MS-SQL is something like $7k per core, and kdb+ is definitely a lot more useful to me than MS-SQL.

There's also a per-core/minute pricing which might be useful.

dancek · on April 8, 2018

Sure, kdb+ would probably be worth every penny even at $100k/year when it's the right tool for the job. I gather it's genuinely the best in-memory database for computing arrays of varying rank.

But a lot of the use cases these other tools are good for are small tasks every now and then. I feel kdb+ is in a different category.

aplorbust · on April 8, 2018

Anecdote: I frequently use kdb+ for small tasks. For me, its in the "all-purpose" category. The limitations are only in the ability I have to use it.

For example, removing nonconsecutive, duplicate lines from a file, such as a CSV file:

   exec echo "k).Q.fs[l:0::\`:$1];l:?:l;\`:$1 0:l"|exec q >&2;

where Q.fs is a function in a script thats bundled with the interpreter; the chunk size for reading the file into memory is adjustable by editing the function.

geocar · on April 8, 2018

You can make it simpler:

    l:0;.Q.fs[{if[x~l;:];-1 l::x}each]`:input

or if you have memory:

    -1 distinct read0`:input

...or if you want to use k:

    -1@?0:`:input

aplorbust · on April 8, 2018

Stupid question: With -1, how would I suppress the return value? Use a function?

   k)a:{-1@?0:`:input;};a[]

geocar · on April 8, 2018

It's not a stupid question.

What this does is return generic-null :: which .Q.s doesn't print.

bjoli · on April 8, 2018

I took the time to learn recutils a long time ago, and it has been the gift that keeps on giving

Sure, it is not as fast as many other formats, but on the other hand it integrates very well into Emacs an org-mode. I manage a large part of my different collections using a combination of both, and the Emacs integration means it is all less than 2 seconds away.

jwilk · on April 8, 2018

If you're curious what's "tabular JSON":

https://johnkerl.org/miller/doc/file-formats.html#Tabular_JS...

netol · on April 8, 2018

I don't understand why csvkit is listed in the SQL-based utilities section. csvkit is a suite of multiple command-line tools, including csvcut, csvsort, csvgrep, csvjson, csvstat, csvstack, csvjoin, etc. and multiple converters, so is not only csvsql

netol · on April 8, 2018

Awesome. But a list like this could grow indefinitely. I wrote two CSV utilities a few years back; a data generator (https://github.com/pereorga/csvfaker) and a column randomizer (https://github.com/pereorga/csvshuf)

manaktir · on April 8, 2018

On macOS there's also textutil, a pre-installed utility for working with text in different formats. Manpage: https://developer.apple.com/legacy/library/documentation/Dar...

AdamJacobMuller · on April 8, 2018

I'm very glad to see the 'silly' tools there, cut/join/paste/sort/uniq. While I would never build anything 'important' with them, they're an extremely useful tool to have in your toolkit.

badsectoracula · on April 8, 2018

Why silly? I use those (especially sort and uniq) all the time, both in my scripts and in command line.

joakinen · on April 8, 2018

If it's important, then you should use those POSIX tools

stevoski · on April 8, 2018

Can anyone recommend a command line tool for manipulating Excel files, that runs on macOS?

Edit: I’m looking for a command line tool that allows me to open an Excel file, make a few simple changes, and then save again as an Excel file.

clarkdave · on April 8, 2018

If you don’t mind converting the Excel file to CSV, csvkit[0], which is mentioned in the list, has a tool to pipe Excel into CSV for further processing by its sibling tools.

It won’t help if you need to retain anything Excel specific, but I find it very useful to deal with any Excel files that come my way.

[0] https://csvkit.readthedocs.io/en/1.0.3/

gwn7 · on April 8, 2018

Try visidata.

https://github.com/saulpw/homebrew-vd

reificator · on April 8, 2018

What an unfortunate choice of repository name. I definitely do not want to get Homebrew VD.

geocar · on April 8, 2018

xlsx files are also zipfiles that contain xml, so you might get away with just unzipping them, then using some xml query, then zipping it back up.

chrisweekly · on April 11, 2018

https://github.com/SheetJS/js-xlsx is a solid choice for that kind of thing.

jwdunne · on April 8, 2018

Ruby has a great gem too. I used it to migrate a client's membership data from an excel spreadsheet to a SQL data model.

This sheet was formatted like:

MEMBERS ...rows...

ADMINS ...rows...

EXECUTIVE COMMITTEE ...rows...

You could whip up any command line tool you need with that.

paulsutter · on April 8, 2018

Python is one easy / flexible way to wrangle Excel files

vram22 · on April 8, 2018

Yes, there are multiple Python libraries for wrangling Excel files as well as good built-in CSV support via the stdlib's csv module - which, despite its name, can actually support DSV (Delimited Separated Values), which is a generalization of CSV. The csv module also has a dialects feature with attributes like settable delimiters (which is how you get the DSV support) and quoting. And since Python's built-in data structures like lists, dicts, tuples and sets are great for munging data, you can get a lot done with just that, plus the benefit of Python's readability and productivity.

nol13 · on April 8, 2018

csvfix, prob some overlap, but i've found this one invaluable.

http://csvfix.byethost5.com/csvfix15/csvfix.html

zabzonk · on April 8, 2018

CSVFix author here - please note a better link is https://neilb.bitbucket.io/csvfix/

nol13 · on April 8, 2018

Sweet, ty! Staring at what was starting to look like a much larger python script than I'd anticipated, then realizing I could do it in 16 lines of (very basic) bash with csvfix + csvcut/sed/iconv was big day for me! Some of my fav code never written I think. Actually had most those files copied locally because was afraid the bytehost link would disappear.

That said, link to the manual in the bitbucket link not working.

forinti · on April 8, 2018

Thanks for this link! I frequently have to load CSV files into a database and they are invariably full of errors. People think spitting out CSV is easy, but it's because they don't have to use their product. So every time I write a Perl script and go through various iterations before I find all that's wrong with the file.

zatkin · on April 8, 2018

This is missing comparison tables.

jwilk · on April 8, 2018

What do you want to be compared?