I don't want to be negative but this will not work on CSV files that contain newlines inside columns. You should also implement escaping quotes - this is very important to ensure that the data can round-trip! Good for a first attempt, but as simple as CSV may look, there are subtle points that you should be aware of if you want to handle most if not all CSV out there. I've worked with lots of other implementations that get these little things wrong too, and it is particularly aggravating when CSV is supposed to be a pretty standard interchange format for tabular data. Please refer to RFC4180 for the details.
MS Excel also defaults to using /r as the newline char on the OSX version (despite /r being obsolete in OSX since v10.1). It's true that CSV parsers have been around in various forms for a long time but 'complete' implementations in JavaScript haven't been available until recently.
Not an excuse, and not really true if we're talking about libraries. Naive one-off implementations of readers maybe.
>heck, ms excell did that until a few versions ago
Nope, Excell, even from 14 years ago has an extensive panel for configuring how (encoding, terminator, etc) your csv file is, and can also handle newlines inside items just fine.
Large frameworks are great, but generally I only adopt them when starting a new project; or the occasional large refactoring project.
Whereas targeted problem-solving libraries like this one get adopted all day long. Next time I need to parse a CSV on the client, you can bet the first thing I'll do is Google to see if there's a library I can download and be done with it.
So, great job solving a targeted problem! Keep building things and keep contributing. People will find code like this endlessly useful.
The fact that you're 16 and getting into software development is cool, but not particularly exceptional. I don't know how much more exceptional it is that you're already into OSS, as that was less of a concept when I was at that age...though I'm going to say that as someone much older than you who is just now slowly contributing to OSS, you're probably ahead of the curve.
But what impresses me most is that you're only 16 and you care enough about something as "boring", yet bread-and-butter as CSV parsing. CSV-parsing is not particularly sexy, and there are a lot of libs that do it...but tackling it as a problem to solve, nevermind open-sourcing your solution, is really cool...even if you decide you don't want to be a full-time software developer, your experience (and tolerance) for this kind of data-munging is going to make you a valuable person to work with in any field.
The fact that you're taking the time to write readable documentation for this puts you even farther afield.
In terms of criticism/feedback...what was your reasoning for returning an Object with certain domain-specific properties (i.e. `data` and `fields`) rather than just an Array of hashes? I think the CSV constructor call works fine, and that being the case, maybe add an option in the options hash to return a CSV-object-with-metadata...but by default, `csv.parse()` should return a plain array of objects, so that other programs that utilize it don't have to remember that the data is actually in the Object's `data` property.
Also, you've probably seen D3's implementation, which is designed with the mindset that the CSV-to-be-parsed is likely an external file: https://github.com/mbostock/d3/wiki/CSV
I wish there had been a github when I was his age.
I remember writing a .ini file parser in Java. I showed it to my friends who couldn't care less and... that was the end of it.
Before GitHub there was Google Code, and before Sourceforge and before there was planet-source-code.com, etc. (all sites are still up, just not cool anymore)
I had some inspiration from PapaParse, which is why I went with a returned object.
However, after reading your comment, remembering how Ruby does it, and some thought, I decided to default to a returned array but provide the option for a more 'detailed' response.
Okay, cool. I'm surprised--but pleased--since inspiring others to code is one of my ambitions.
Your parser is off to a good start: you nailed the general idea of good API design. In other words, it's easy to use, which is immediately appealing. For other things, I think there's been great feedback in the rest of these comments.
(By the way, Papa Parse only uses jQuery for convenience in interacting with the DOM for things like file input elements. The core parser, including the file streamer, is vanilla Javascript.)
I might be mistaken, but the second field in the first record looks invalid in that CSV. You'd need """x""" for that to work (quoted field and escaped quotes inside).
Don't cast by default. It unnecessarily slows down the parser when the input is purely string data.
Don't hesitate to look at how other CSV libraries work. They've likely solved most/all of the issues you're currently having.
Do add the ability to cast scalar values (ex ints, floats).
Do verify RFC compliance using tests.
Do use vanilla Javascript. Niceties like Coffeescript are good for one-off webapps but add unnecessary dependencies/problems if you're targeting a larger audience.
Do have a solid plan for versioning. Only add backwards-incompatible changes to major versions (ie 1.0, 2.0).
Do have your users try the code in different browsers/platforms. Differences in Regex implementations can cause problems and using streams in Node will be completely different than handling them in the browser.
You win the million dollar prize if you can manage to figure out how to write a CSV stream reader that works in the browser.
You should ditch CoffeeScript and do it in vanilla JS, if for no other reason than to avoid people like me telling you to ditch CoffeeScript and do it in vanilla JS. ;)
In all seriousness though, far beyond anything I was doing at 16. Kudos.
I second this. I use and like CoffeeScript, especially if I'm pretty sure nobody else will touch my code, and especially if I'm just tinkering or exploring.
But after doing a lot of mostly CoffeeScript, I had to do some 'plain' JS work, and it took me a while to get back into it. And that can be a problem if you just need to get stuff done sometimes.
(PS: kudos for submitting this! I'm more than a decade older than you are, and I get nervous just thinking of sharing my code with this crowd!)
Mike Bostock pointed out that he had already extracted D3's DSV parsing into it's own repo when i expressed my disappointment i couldn't use d3's w/o rewriting parts of D3's DSV parser.
It's not as complex as you make it sound. The parser should pass along any non-terminal characters without issue.
The rest of the edge cases (ex newlines in data) can be handled by using a proper DFM (Deterministic Finite State Machine). None of that String.split('\n').split(',') garbage.
If you're processing text as binary without using a string reader that can differentiate between UTF-8 and ASCII then you're doing it wrong.
With that said, I agree completely that people should use an established library. Code that has been viewed, used, broken by thousands of users is infinitely better than any home grown variant.
Source: With lots of blood sweat and tears I authored one of those 'solid' libraries.
Once you have your state machine working, your best bet to optimize speed is by limiting string copy operations. I managed this by using a regex tokenizer that groups any non-terminals (ex data between quotes).
I wrote it with the intent of providing a lib that can effectively parse CSV data from the browser (loaded remotely via AJAX or locally via the HTML5 File API).
The biggest weakness of parsing CSV on the browser is the inability to process data streams. That 2GB memory limit on JS scrips in the browser becomes a fundamental weakness when you're trying to process large CSV files.
CSV in general is terrible for data storage, unless it's only used for serialization because arbitrarily reading any point in the input data stream requires the parser to start from the beginning. You're basically screwed if you can't hold the whole dataset in memory as a 2D array.
Consider starting at the default behavior with the documentation. For example, in the parse section, if no header option is provided, the first row will be returned as an array correct?
Start there: "The default behavior is to return each row of the CSV as an array." then "If the CSV file has a header, pass in {header: true} to get back an object using the header values as keys." then "If the file does not have a header, pass in ...".
Also, can a user override a header that is present in the data?
This'll return a CSV based off of a JavaScript array of objects (or arrays). You'll have to find something else to allow the user to download a .csv, since that's out of the use-case this was designed for.
I don't want to be negative but this will not work on CSV files that contain newlines inside columns. You should also implement escaping quotes - this is very important to ensure that the data can round-trip! Good for a first attempt, but as simple as CSV may look, there are subtle points that you should be aware of if you want to handle most if not all CSV out there. I've worked with lots of other implementations that get these little things wrong too, and it is particularly aggravating when CSV is supposed to be a pretty standard interchange format for tabular data. Please refer to RFC4180 for the details.