Hacker News new | past | comments | ask | show | jobs | submit login
CSVjs: Basic CSV parsing and encoding in JavaScript (github.com/knrz)
66 points by knrz on May 24, 2014 | hide | past | favorite | 47 comments



> text.split("\n")

I don't want to be negative but this will not work on CSV files that contain newlines inside columns. You should also implement escaping quotes - this is very important to ensure that the data can round-trip! Good for a first attempt, but as simple as CSV may look, there are subtle points that you should be aware of if you want to handle most if not all CSV out there. I've worked with lots of other implementations that get these little things wrong too, and it is particularly aggravating when CSV is supposed to be a pretty standard interchange format for tabular data. Please refer to RFC4180 for the details.


yet, that is what most csv code does. heck, ms excell did that until a few versions ago.


MS Excel has supported CSV that had newlines in fields for a least a decade.


MS Excel also defaults to using /r as the newline char on the OSX version (despite /r being obsolete in OSX since v10.1). It's true that CSV parsers have been around in various forms for a long time but 'complete' implementations in JavaScript haven't been available until recently.


>yet, that is what most csv code does.

Not an excuse, and not really true if we're talking about libraries. Naive one-off implementations of readers maybe.

>heck, ms excell did that until a few versions ago

Nope, Excell, even from 14 years ago has an extensive panel for configuring how (encoding, terminator, etc) your csv file is, and can also handle newlines inside items just fine.


This kind of thing is incredibly useful.

Large frameworks are great, but generally I only adopt them when starting a new project; or the occasional large refactoring project.

Whereas targeted problem-solving libraries like this one get adopted all day long. Next time I need to parse a CSV on the client, you can bet the first thing I'll do is Google to see if there's a library I can download and be done with it.

So, great job solving a targeted problem! Keep building things and keep contributing. People will find code like this endlessly useful.


(sorry accidentally hit the downvote button, so commenting to give you the karma back. hmm that used to work afaik, doesn't seem like it helped.)


I'll live. :)


First anything I've contributed to the community.

As a 16 year-old just getting into software development, it'd be amazing to get feedback from hn.


The fact that you're 16 and getting into software development is cool, but not particularly exceptional. I don't know how much more exceptional it is that you're already into OSS, as that was less of a concept when I was at that age...though I'm going to say that as someone much older than you who is just now slowly contributing to OSS, you're probably ahead of the curve.

But what impresses me most is that you're only 16 and you care enough about something as "boring", yet bread-and-butter as CSV parsing. CSV-parsing is not particularly sexy, and there are a lot of libs that do it...but tackling it as a problem to solve, nevermind open-sourcing your solution, is really cool...even if you decide you don't want to be a full-time software developer, your experience (and tolerance) for this kind of data-munging is going to make you a valuable person to work with in any field.

The fact that you're taking the time to write readable documentation for this puts you even farther afield.

In terms of criticism/feedback...what was your reasoning for returning an Object with certain domain-specific properties (i.e. `data` and `fields`) rather than just an Array of hashes? I think the CSV constructor call works fine, and that being the case, maybe add an option in the options hash to return a CSV-object-with-metadata...but by default, `csv.parse()` should return a plain array of objects, so that other programs that utilize it don't have to remember that the data is actually in the Object's `data` property.

Also, you've probably seen D3's implementation, which is designed with the mindset that the CSV-to-be-parsed is likely an external file: https://github.com/mbostock/d3/wiki/CSV


I wish there had been a github when I was his age. I remember writing a .ini file parser in Java. I showed it to my friends who couldn't care less and... that was the end of it.


Before GitHub there was Google Code, and before Sourceforge and before there was planet-source-code.com, etc. (all sites are still up, just not cool anymore)


I had some inspiration from PapaParse, which is why I went with a returned object.

However, after reading your comment, remembering how Ruby does it, and some thought, I decided to default to a returned array but provide the option for a more 'detailed' response.


Some feedback:

- Vanilla JS would be cleaner

- Make it useable with Node. (Check if "window" variable exists, if not, do export CSV as a module. Should be this easy AFAIK)

- You have some typos on the README.md

- Put this on bower and npm

Good work. I was writing software at 16 but I wasn't doing it open source, I wish I did.


Vanilla JS is up :) It's now usable with both Node and AMD. Typos should all be fixed. I'll put up on bower and npm once I become RFC4180 compliant.

Thanks for the feedback!


What's AMD? I keep seeing that acronym but searching it just returns "Advanced Micro Devices"



Vanilla JS would be cleaner

Depends on your definition of 'clean' - IMO CoffeeScript is much cleaner than JS. But yes, more difficult to work with.


Where did you get your inspiration?

This is pretty similar to a library I'm actively working on called Papa Parse: http://papaparse.com


It was actually your library that inspired me :).

My reason for making this was a mix of "hey let's try making this", let me make something even shorter and (hopefully) sweeter, and "no jQuery".


Okay, cool. I'm surprised--but pleased--since inspiring others to code is one of my ambitions.

Your parser is off to a good start: you nailed the general idea of good API design. In other words, it's easy to use, which is immediately appealing. For other things, I think there's been great feedback in the rest of these comments.

(By the way, Papa Parse only uses jQuery for convenience in interacting with the DOM for things like file input elements. The core parser, including the file streamer, is vanilla Javascript.)


Note that it fails to parse CSV like:

    "x",""x"",,"x
    x","x"
    "y",,,,123


The previous commit (probably) failed at that. However, with the latest commit, parsing that malformatted CSV returns:

    [
      ["x","x\"","","x"],
      ["x","x"],
      ["y","","","",123]
    ]
I can't see anything wrong with it. Am I missing something?


It should parse as:

    [
      ["x","\"x\"","","x\nx","x"],
      ["y","","","",123]
    ]
It's valid CSV, inasmuch as it's accepted and generated by a lot of tools and RFC 4180 ABNF:

    escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE


I might be mistaken, but the second field in the first record looks invalid in that CSV. You'd need """x""" for that to work (quoted field and escaped quotes inside).


You're right. Removing the claim of compliance until I've devised a solution.


(Are you talking about Papa, seeing that your comment is a response to mine?)

That CSV text is malformed. Arguably, the parsing succeeded, but the expected errors were generated and reported.


Don't cast by default. It unnecessarily slows down the parser when the input is purely string data.

Don't hesitate to look at how other CSV libraries work. They've likely solved most/all of the issues you're currently having.

Do add the ability to cast scalar values (ex ints, floats).

Do verify RFC compliance using tests.

Do use vanilla Javascript. Niceties like Coffeescript are good for one-off webapps but add unnecessary dependencies/problems if you're targeting a larger audience.

Do have a solid plan for versioning. Only add backwards-incompatible changes to major versions (ie 1.0, 2.0).

Do have your users try the code in different browsers/platforms. Differences in Regex implementations can cause problems and using streams in Node will be completely different than handling them in the browser.

You win the million dollar prize if you can manage to figure out how to write a CSV stream reader that works in the browser.


You should ditch CoffeeScript and do it in vanilla JS, if for no other reason than to avoid people like me telling you to ditch CoffeeScript and do it in vanilla JS. ;)

In all seriousness though, far beyond anything I was doing at 16. Kudos.


I second this. I use and like CoffeeScript, especially if I'm pretty sure nobody else will touch my code, and especially if I'm just tinkering or exploring.

But after doing a lot of mostly CoffeeScript, I had to do some 'plain' JS work, and it took me a while to get back into it. And that can be a problem if you just need to get stuff done sometimes.

(PS: kudos for submitting this! I'm more than a decade older than you are, and I get nervous just thinking of sharing my code with this crowd!)


Repo updated, pure JS now :D.



Mike Bostock pointed out that he had already extracted D3's DSV parsing into it's own repo when i expressed my disappointment i couldn't use d3's w/o rewriting parts of D3's DSV parser.

It lives here: https://github.com/mbostock/dsv

ps it properly scans & tokenizes CSVs and handles quoted fields fine as a consequence


If you are doing large-scale CSV parsing or encoding in Node, you may also find these two packages useful:

https://www.npmjs.org/package/binary-csv

https://www.npmjs.org/package/csv-write-stream

They are both written with handling very large files in mind, so they use buffers and streams for i/o.


The CSV 'format' is hell. I just wrote a blog highlighting some of the issues : http://tburette.github.io/blog/2014/05/25/so-you-want-to-wri...


It's not as complex as you make it sound. The parser should pass along any non-terminal characters without issue.

The rest of the edge cases (ex newlines in data) can be handled by using a proper DFM (Deterministic Finite State Machine). None of that String.split('\n').split(',') garbage.

If you're processing text as binary without using a string reader that can differentiate between UTF-8 and ASCII then you're doing it wrong.

With that said, I agree completely that people should use an established library. Code that has been viewed, used, broken by thousands of users is infinitely better than any home grown variant.

Source: With lots of blood sweat and tears I authored one of those 'solid' libraries.


Ever heard of jquery-csv?

I wrote the jquery-csv over two years ago with the goal of being the first completely RFC compliant CSV parser for Javascript.

It integrates with the CSV namespace but doesn't depend on it, uses pure vanilla JS, works with Node.js.

At the very least, if you want to claim RFC compliance you should have a test runner that verifies that your code doesn't break on the edge cases.

https://code.google.com/p/jquery-csv/source/browse/test/test...

Once you have your state machine working, your best bet to optimize speed is by limiting string copy operations. I managed this by using a regex tokenizer that groups any non-terminals (ex data between quotes).

I wrote it with the intent of providing a lib that can effectively parse CSV data from the browser (loaded remotely via AJAX or locally via the HTML5 File API).

The biggest weakness of parsing CSV on the browser is the inability to process data streams. That 2GB memory limit on JS scrips in the browser becomes a fundamental weakness when you're trying to process large CSV files.

CSV in general is terrible for data storage, unless it's only used for serialization because arbitrarily reading any point in the input data stream requires the parser to start from the beginning. You're basically screwed if you can't hold the whole dataset in memory as a 2D array.


Consider starting at the default behavior with the documentation. For example, in the parse section, if no header option is provided, the first row will be returned as an array correct?

Start there: "The default behavior is to return each row of the CSV as an array." then "If the CSV file has a header, pass in {header: true} to get back an object using the header values as keys." then "If the file does not have a header, pass in ...".

Also, can a user override a header that is present in the data?


Thanks for the feedback on the docs. Working on them now.

As for overriding the header, I think its possible utility offsets the ~3 LoC that it adds. Coming next commit.


So will this let me make a csv file in JavaScript and actually then let the user download it as a file?

That would really simplify a lot of my workflows instead of having to make separate csv and HTML views I django.


You could create a download file, but only in newer browsers.

We have an internal application which requires CSV export, we use something like this:

        csv = cleanTurkishChars(makeCSV(data));
        csv = 'data:application/csv;charset=utf-8,' + encodeURIComponent(csv);
        $("#csvexport").attr({
            'href': csv,
            'target': '_blank'
        });


Thanks! Any idea where I can read more about it? What browsers support it? Are there file size limitations?


This'll return a CSV based off of a JavaScript array of objects (or arrays). You'll have to find something else to allow the user to download a .csv, since that's out of the use-case this was designed for.


in case someone wants a more mature parser: https://github.com/koles/ya-csv


Hey, I needed JS CSV parser few days ago. Here is my parser, I bet it is 100 times faster than yours :)

function parseCSV(str) { var obj = {}; var lines = str.split("\n"); var attrs = lines[0].split(","); for(var i=0; i<attrs.length; i++) obj[attrs[i]] = []; for(var i=1; i<lines.length-1; i++) { var line = lines[i].split(","); for(var j=0; j<line.length; j++) obj[attrs[j]].push(line[j]); } return obj; }


Congratulations, this code failed to properly parse my CSV at least 100 times faster than OP's :)


Looks like you've missed values enclosed in quotes that might contain new line characters or commas -- very important for things like address fields.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: