Hacker News new | past | comments | ask | show | jobs | submit login
XSV – A fast CSV toolkit in Rust (github.com/burntsushi)
101 points by mseri on Feb 22, 2015 | hide | past | favorite | 47 comments



Author here. I was really hoping to get binaries for Windows/Mac/Linux available before sharing it with others, but clearly I snoozed. I do have them available for Linux though, so you don't have to install Rust in order to try xsv: https://github.com/BurntSushi/xsv/releases

Otherwise, you could try using rustle[1], which should install `xsv` in one command (but it downloads Rust and compiles everything for you).

While I have your attention, if I had to pick one of the cooler features of xsv, I'd tell you about `xsv index`. Its a command that creates a very simple index that permits random access to your CSV data. This makes a lot of operations pretty fast. For example:

    xsv index worldcitiespop.csv  # ~1.5s for 145MB
    xsv slice -i 500000 worldcitiespop.csv | xsv table  # instant, plus elastic tab stops for good measure
That second command doesn't have to chug through the first 499,999 records to get the 500,000th record.

This can make other commands faster too, like random sampling and statistic gathering. (Parallelism is used when possible!)

Finally, have you ever seen a CLI app QuickCheck'd? Yes. It's awesome! :-) https://github.com/BurntSushi/xsv/blob/master/tests/test_sor...

[1] - https://github.com/brson/rustle


I'm looking forward to playing with cool languages like Rust, Nim, and Elm. But when I read stuff like this I remember why I love using Go every day. Generating binaries for multiple platforms is braindead easy, as is building from source on any system with Go installed.

That aside, really great work OP! I quite like the CSV format and had 2 ideas based on my experience with it that I'd love to get an opinion on:

1. markdown compiler plugin to expand ![title](filename.csv)

2. barebones, imgur-like website for quick CSV file[s] upload, maybe also a public gallery to showcase interesting data (obviously all uploads marked public/unlisted/private)


> But when I read stuff like this I remember why I love using Go every day.

Me too! I wrote a window manager in Go[1] that I've been using for years now. I love that it takes <30 seconds to download and compile the whole thing. No C dependencies (compile or runtime) at all.

With that said, doing it with Rust should be almost as easy. There's no `cargo install` command, but I think it's only a matter of time. :-)

Your ideas seem cool, by the way! Sharing CSV data would be especially nice.

[1] - https://github.com/BurntSushi/wingo


Thanks for your support, I'll start working on the idea then. I just checked your Gh repos, your productivity is astonishing. Chapeaux! Also, cute handle.

I suspect the problem with CSV will be working around the myriad of broken implementations and making sense of malformed data.


> I suspect the problem with CSV will be working around the myriad of broken implementations and making sense of malformed data.

Indeed. My CSV parser (and Python's) is pretty interesting in that regard. There are very few things that actually cause a parse error. You can see here[1] that the only two errors occur if there are unequal length records (which can be disabled by enabling the "flexible" option) and invalid UTF-8 data (which can be avoiding by reading everything into plain byte strings). That means that any arbitrary data gets parsed into something. There are various mechanisms in the CSV parser's state machine that make decisions for you. Mostly, I used the same types of decisions that Python makes. For example:

    >>> import csv
    >>> from StringIO import StringIO
    >>> list(csv.reader(StringIO('a, "b,c')))
    [['a', ' "b', 'c']]
    >>> list(csv.reader(StringIO('a,"b,c')))
    [['a', 'b,c']]
Whaaaa? Yeah, if our CSV parsers were conformant with the spec, then both of these examples should fail. But they succeed and result in slightly different interpretations based on whether a space character precedes the quote. Therefore, "good" CSV parsers tend to implement a superset of RFC 4180 when parsing, but usually implement it strictly when writing.

(My CSV parser ends up with the same parse as Python here, because it seemed like a good decision to follow its lead since it is used ubiquitously.)

[1] - http://burntsushi.net/rustdoc/csv/enum.ParseErrorKind.html


> There's no `cargo install` command, but I think it's only a matter of time. :-)

https://github.com/rust-lang/cargo/pull/1318


I have never used Rust before, but I was able to install it and compile XSV on OS X in about 5 minutes. So maybe cross-compilation isn't straightforward yet, but building from source definitely is.


FWIW one of the reasons that I have been heavily looking at Nim for the last few weeks is its ease of generating static binaries.


Sorry for not looking into it myself, but what is the format of your index file (I'm assuming the index is stored as a file somewhere). Could other tools read from that too?


It's extremely simple. Even if you don't know Rust, this code should be readable: https://github.com/BurntSushi/rust-csv/blob/master/src/index...

Basically, it writes a 64 bit byte offset (big endian) for the start of each record. The very last byte is then the total number of records. (I forget why I'm doing this, since the count should be computable from the size of the index file.) The "start of each record" means "if you point a CSV parser at this byte offset, then it will successfully read a record."

So when you want to read record `i`, seek the index to byte offset `i * 8`, read 8 bytes (big endian) into a 64 bit integer, then seek to that byte offset in your CSV data.

If the CSV data changes, then you're in trouble. One easy way to handle it (which isn't bulletproof) is to check the last modified times of the index and the CSV file and regenerate the index (or refuse to use it with an error).


For an append to a CSV file it shouldn't be too hard to update the index. Remove the last byte in your current format and just parse the additional lines and append to your index.

And I'm sure there's a technique out there to handle the update case. I'm not too familiar with hard disk operations so I wouldn't know it.

If you insert a byte at position 2 in a 1GB file will the whole file get rewritten? Or are there techniques out there to store updates to a char array without having to rewrite the whole thing?


I'm pretty sure you're barking up a long and complex tree. The first problem (for xsv) is that it doesn't control the csv data. So it doesn't know what changes were made. It could have even become an invalid csv file.

The second problem is that insertions in sorted sequences are hard and are one of the fundamental problems in a database implementation: http://en.wikipedia.org/wiki/B-tree#The_database_problem


Agreed. It's complex but a common situation so I knew there'd be a lot of well tested approaches.


Here's another suggestion for the criticism section (which is a good idea for any open-minded project to include):

Instead of using a separate set of tools to work with CSV data, use an adapter to allow existing tools to work around CSV's quirky quoting methods.

csvquote (https://github.com/dbro/csvquote) enables the regular UNIX command line text toolset (like cut, wc, awk, etc.) to work properly with CSV data.


That's a wicked cool tool! Thank you for sharing.

I do think there is room for both tools though. One of the cooler things I did with `xsv` was implement a very basic form of indexing. It's just a sequence of byte offsets where records start in some CSV data. Once you have that, you can do things like process the data in parallel or slice records in CSV instantly regardless of where those records occur.

It helps when the CSV parser has support for this: http://burntsushi.net/rustdoc/csv/struct.Reader.html#method....


From the "criticisms" section: You shouldn't be working with CSV data because CSV is a terrible format.

Er, what's wrong with it? Or is this a case of, people using it for things other than what it's meant for? Is there a better format for sending data between different companies using different enterprisey database systems?

My complaint about csv is that people frequently generate it manually and don't understand how to quote text fields, so they don't double any quote characters that are part of the data. Which means I have to spend time cleaning up malformed files.


> what's wrong with [CSV]?

The encoding, line endings, and the used separators are random. Encoding/decoding behavior may be locale-dependent. E.g. if you use Excel with a de-DE locale to create a CSV file, Excel with a en-US locale won't open it. The former uses ';' as separators and the latter expects ','.

Since the format is so (seemingly) simple, there are millions of ad hoc implementations out there. Naturally, most of them are horribly broken.

Fun fact: ASCII actually reserved control characters for this stuff. 1F is the "unit separator" and 1E is the "record separator". There is even a "group separator" (1D) and a "file separator" (1C).

http://en.wikipedia.org/wiki/ASCII#ASCII_control_code_chart


>Fun fact: ASCII actually reserved control characters for this stuff. 1F is the "unit separator" and 1E is the "record separator". There is even a "group separator" (1D) and a "file separator" (1C).

Which means you can't safely use it for arbitrary data (since your records themselves could contain these separators). Most of the time that doesn't matter; sometimes it does.


You can't use these control characters in CSV either. The grammar from RFC 4180 excludes the 0x00 to 0x1F range.


We need the equivalent of CommonMark. CommonCSV anyone?


We have that already, RFC 4180: https://tools.ietf.org/html/rfc4180

The problem is adoption. Most CSV parsers accept a superset of RFC 4180, because the only thing worse than not adhering to a good spec is tripping over CSV data in the wild.


RFC 4180 does not specify the character encoding.

This isn't a problem with JSON, XML, ASN.1, Protobufs, or Thrift. Anything other than CSV, really.


Indeed it does, although not directly. From the RFC:

> TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

So, it is a subset of ASCII whose hex codes must lie in the ranges given above. Basically, any printable ASCII character with the exception of 0x22 (= double quote) and 0x2C (=comma) which serve other purposes.


From "3. MIME Type Registration of text/csv":

  Common usage of CSV is US-ASCII, but other character sets defined
  by IANA for the "text" tree may be used in conjunction with the
  "charset" parameter.
Basically, you can specify that it's UTF-8 or whatever via the Content-Type response header. Setting this response header isn't something an encoder could do. A CSV parsing library also shouldn't be responsible for loading the CSV file via HTTP. In most cases, CSV files aren't directly fetched via HTTP anyways.

Anyhow, ASCII is utter nonsense. We aren't living in an ASCII world. Even the Americans don't do that. An ASCII-only format is completely useless.


I think csv is great. The problem is, as you said, that people don't use dedicated csv parsers, and create utter crap.

The other problem is Microsoft Excel. It is installed on almost any Windows machine, and automatically mapped to *.csv file endings. The first I do on Windows machines is to map _any_ text based file format to notepad++.

Excel destroys any csv file upon saving. It can't handle different decimal point formats. File separation characters are mappes to tabs. Multiline text is destroyed (not sure if true anymore, in Office 2007 it was). And don't even get me started about dates.


>Excel destroys any csv file upon saving.

This isn't exactly true, although I understand why people think it is. The problem is that when you "open" a csv file in Excel it guesses what type the fields are and sets that datatype for the column. This will frequently mangle the data when written back to a file as you noted.

However you can "import" the csv file using the wizard and that allows you to set all of the options, including datatype. Setting this to 'text' seems to prevent Excel from mangling the data.

I tend to use a combination of CSVKit, which seems plenty fast for my usage, and Excel when dealing with csv files.


Maybe there is some hidden wizard (although I've never seen it) but the fact remains that the default behaviour on a Windows machine with Office, is when you double click a CSV Excel opens it, and when you hit save Excel mangles it.

Sure, you can go through and fix every single column one by one to fix it, but defaults matter. And in this case the default Excel behaviour is horrifyingly bad.


I haven't used Excel for this in a while but as I recall the "hidden wizard" is basically file -> import from Excel itself. That said I agree that the double click behavior is pretty rubbish.


"CSV" is bad because it's not well-formed. There are tons of "CSV" parsers in the wild, and they all make reasonable, but different, choices when it comes to some behaviors. INI is the same way.


Exactly.

People keep linking to the spec like that matters. The reality on the ground is that CSV has tons of different incompatible implementations or just simply terrible parsers. Even if you produce spec-correct CSV it won't work with a lot of third party's libraries or parsers.

Just ask anyone who works with CSV and lots of companies about double quotes around fields, I assure you they'll have a flag in their system to: include them, exclude them, and include them in records which contain commas only.

Plus CSV and internationalisation is a giant cluster fuck in its own right. Ever seen a U+FF0C or other comma[0] in a CSV? Yeah that is fun.

[0] http://en.wikipedia.org/wiki/Comma#Comma_variants



Existence of an RFC doesn't mean that people or tooling conform to it.

The CSV RFC doesn't specify an IETF standard, by the way.


Doesn't matter. Nobody follows it :/


i think JSONH [1] is a great replacement for csv. it's not as human-readable, but since no editors support elastic tabstops [2] by default, most csvs aren't readable anyhow.

[1] https://github.com/WebReflection/JSONH

[2] http://nickgravgaard.com/elastictabstops/


Ironically enough, I've implemented elastic tabstops in Rust: https://github.com/BurntSushi/tabwriter and they are used in xsv! (See `xsv table`.)


JSONH couldn't replace CSV in the one place where it actually makes sense to use CSV: very large files.

It's possible to process a CSV file of unbounded size with a bounded amount of memory because you can just read it line-by-line. With JSON, you need to read the whole file into memory before you can get at the data.


I'm not familiar with JSONH, but with JSON, you can cheat by encoding a single "record" via JSON on a single line. :-) That way, you can read one line at a time:

    {"field": "some\\ntext with new lines"}
    {"field": "more text"}


well, there's nothing stopping you from writing a streaming json parser. that can read <x> fields at a time. you can format your input json to have whatever whitespace you need, including line breaks. the assumption is that the json will be encoded and not hand-written anyways, so the validity of the entire file isnt important since it will barf at record-level if malformed, similar to csv.

so rather than:

    [2,"col_1","col_2","a","b","c","d"]
you can write:

    [2,"col_1","col_2",
        "a","b",
        "c","d"
    ]
and it will still be valid JSON/H and line-by-line parsed. or as 'burntsushi suggested, you can wrap each record in an array to pass through a normal json parser, though this wouldn't be JSONH any longer:

    [
        ["col_1","col_2"],
        ["a","b"],
        ["c","d"]
    ]


Sure, you can write a custom parser to handle this, but then you're no longer working with JSON/JSONH. It's a subset that's nearly identical, but with an extra constraint imposed on the serialization structure. (We'll call it JSON`)

It sounds like I'm being pedantic, but if you're not your sole producer and consumer of the data you'll inevitably have to deal with someone giving you valid JSON that isn't valid JSON`.


all of the above are valid json. especially the last simple variation where you literally just have a 2-level nested array.


Err, I'm not sure if I understand [2] correctly but Emacs's csv-align-fields displays CSVs pretty nicely: https://cxwangyi.wordpress.com/2014/05/09/editing-csv-files-...


Did you try benchmarking against kdb+?

Seems like there are always HN commenters lambasting CSV. I am sure they have very good reasons.

But, as for me, CSV is one of my favorite formats. (Sort of like how people like XML or JSON I guess.) I like the limitations of CSV because I like simple, raw data.

I wish the de facto format that www servers delivered was CSV instead of HTML (for reason why, see below). Or at least I wish there was an option to receive pages in CSV in addition to HTML.

Users could create their own markup, client side. Users could effectively use their "spreadsheet software" to read the information on the www. Or they could create infinitely creative presentations of data for themselves or others using HTML5 or some other tool of personal expression.

It is easy to create HTML from CSV but I find it is a nuisance creating CSV from HTML.

Because I have a need for CSV I write scanners with flex to convert HTML to CSV.

I often wonder why I cannot access all the data I need from the www in CSV format. Many have agreed over the years that the www needs more structure to be more valuable as a data source. If data is first created in CSV, then you have some inherent structure to build on; you can _use it_ to create markup and add infinite creativity without destroying the underlying structure.

If data (cf. art or forms of personal expression) cannot be presented in CSV then is it really raw data or is it something else, more subjective and unwieldy?

Whatever. Back to reality. Pay no mind.


> Did you try benchmarking against kdb+?

xsv is never ever never going to compete with a real database. Full stop.

It's just a command line tool that tries to make some things faster when slicing and dicing CSV data.


If you need to do an indexing step anyways, why not simply import the data into a SQL database, or build this as a wrapper that introspects the CSV file, builds a database schema, and does the import for you? Is the issue limited scratch space?


See the "Motivation" section: https://github.com/BurntSushi/xsv#motivation

There's a line somewhere between "conveniently play with large CSV data on the CLI" and "the full power of a RDBMS." It's blurry and we won't all agree on where it lays, but it certainly exists for me. (And based on feedback, it exists for lots of others too.)

Also, there are already tools that look at a CSV file and figure out a schema. No need to replicate that.

Finally, the indexing step is blindingly fast and only uses `N * 8` bytes, where `N` is the number of records.


Looks like it's based on this CSV parser:

https://github.com/BurntSushi/rust-csv

and it claims to be RFC4180-compliant, which is a good thing.


This is one of the things I really love about PowerShell. Import, manipulation and export of formatted raw data like CSV is dead simple.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: