Hacker News new | past | comments | ask | show | jobs | submit login
So you want to write your own CSV code (2014) (thomasburette.com)
303 points by ColinWright on May 15, 2020 | hide | past | favorite | 253 comments



I wrote my own CSV code. That's only about 10 issues, and I understand them all, and they're easy to handle, and to test (unlike, say, threading issues, or OS bugs that don't happen on my computer).

That wasn't even close to being one of the more difficult parts of my program to write. I've had to do zero work to maintain it. I don't seem to have ever had any bugs filed against it. Are these supposed to be difficult? Sure, don't write your own encryption algorithm, and avoid touching threads if at all possible, but CSV?

Sure, I wish I could have used a library. Unfortunately, once you start adding requirements, the options start to vanish. It needs to be a C (or C-callable) library. It needs to accept input as Unicode strings (lines), not just files on disk, or bytes. It needs to return results line-at-a-time. It needs to have a compatible license. It needs to support the same package manager as I'm using (or be a trivial to include, like a single file). It needs to be reliable (essentially bug-free, or at the very least currently maintained). It needs to be fairly efficient. What library supports all these?

> Writing CSV code that works with files out there in the real world is a difficult task. The rabbit hole goes deep. Ruby CSV library is 2321 lines.

I love Ruby, but that file looks a little nuts to me right now. The block comment at the top of csv.rb is over 500 lines long. The Ruby module tries to do everything. Just "def open(filename, mode="r", options)" and its docstring are 100 lines long! This class includes methods for auto-detecting the encoding, and auto-detecting common date formats, and some kind of system "for building Unix-like filters for CSV data".

I don't put all of that into my CSV parser, because I want my code to be modular. I have features like data detectors and filters and open-file-by-filename in my program, too, but they're fully independent, not tied to any one file format. When I come up with a better way to detect dates in text, I don't want to tweak a 6-line regex in every file format I've written.


1) You wrote a general CSV input parser, but what about someone writing CSV output? The obvious way (write a ',' between each cell and a '\n' between each row) works most of the time, but it doesn't always work. The output needs to be escaped correctly, and data escaping should be well tested. I think the issue is that CSV sounds very easy to implement, but most naive implementations will be wrong.

2) You took on exactly the responsibilities the article author said you'd have to take on. Based on the screenshots for your "Strukt" application, it looks like the user controls configuration/validation for parsing CSV and they will fix any issues with parsing the CSV file themselves.

> If a supplied CSV is arbitrary, the only real way to make sure the data is correct is for an user to check it and eventually specify the delimiter, quoting rule,… Barring that you may end up with a error or worse silently corrupted data.

3) I think the better question is, how do you handle malformed CSV? Is your error correction as robust as Ruby's? Do you handle malformed CSV the same way as other libraries? If it's handled differently by different parsers, you can run into issues like Apple recently did [0]

[0] https://siguza.github.io/psychicpaper/


> You took on exactly the responsibilities the article author said you'd have to take on.

Yep, and having been there, I’m reporting: it’s not nearly as bad as this post makes it sound.

I’ve written web apps, too, and (for example) CSS is 100 times worse. Nobody is suggesting every website should use only browser default styles, though.

> I think the better question is, how do you handle malformed CSV?

What does “malformed” mean? The good/bad thing about CSV is that virtually every text file is valid! The only malformation I can think of is an open quote with no matching close quote (so the entire rest of the document is one value). My implementation is streaming, so there’s no great way to flag it: any future data could have the other quote!

We show the data as it’s parsed, so it should be obvious to the user what is going on, and where.

There are areas of computing that are too complex. I’m usually the first to complain about such things. I really don’t think CSV parsing is one of them.


It's even worse than the author is suggesting. For most people, "RFC4180" is meaningless, all that matters is what Excel does. And that means you need to handle a bunch of cases if you are reading AND if you are writing files. A few cases not discussed in the blog post:

- if your file starts with \x49\x44 ("ID"), Excel will interpret the file as their symbolic link .SLK format. So if you're writing files, the ID should be wrapped in double quotes even if it isn't necessary according to RFC4180

- Excel will proactively try to "evaluate" fields that start with \x3d ("="). You can see this in action with the sample file

    1,2,3
    =SUM(A1:C1)
- Excel will aggressively interpret values as dates when possible. For example, SEPT1 issues https://genomebiology.biomedcentral.com/articles/10.1186/s13...

CSV parsing / writing certainly isn't going to be a value driver for most companies (if you're supporting user imports, you really care about XLSX/XLSB/XLS files and Google Sheets import), but it's not a trivial problem.


"What does “malformed” mean? The good/bad thing about CSV is that virtually every text file is valid! The only malformation I can think of is an open quote with no matching close quote (so the entire rest of the document is one value). My implementation is streaming, so there’s no great way to flag it: any future data could have the other quote!*

An example of improper csv code is this:

hello,world,"And this, contains "bad quotes"...",1,2,3

Lest you think this is made up, I ran across this when someone cut and paste Excel into a text field.

I also have seen batch processing of user files break hard when a quote issue like this caused a hand-rolled CSV parser to conclude that half the file was a single very long field.


I think you have one of the best use cases for rolling your own parser. Your tool's purpose is to read and parse arbitrary data (and then transform and display it). But, I think you're misinterpreting how much work the post is saying CSV will take to implement:

> Easy right? You can write the code yourself in just a few lines.

My take-away from the post is: if you are parsing arbitrary CSV files, you need to make parsing configurable because there's no one, true CSV format. If you are writing CSV files, you may need to escape your fields in a weird, outdated manner.

P.S. By "malformed", I meant whether the 2D matrix of byte arrays is read exactly as intended. It could be caused by an open quote, but it could be incorrect escaping or inconsistent delimiters. Since there's no inline schema saying which CSV parsing configuration is being used, you must ask the user to configure the CSV parser and validate the output.


> I don't put all of that into my CSV parser,

The art of not cluttering a library with too much stuff is a hard lesson to acquire. Most library authors try to throw in the kitchen sink. It's insane. I'm sure I was guilty of it in the past.

For parser take a string. let me do the I/O outside. If you want to offer streams then just offer an abstract interface stream and then a few examples of implementing a stream but don't include them in your library. It's just clutter.

This is one if largest problems with NPM. So many libraries try to do too much. Some come with a command line utility "just because"!??!?! It then adds more surface area, more dependencies, more churn as we have to update all those un-needed dependencies.


Isn't NPM also notorious for ultra minimalist libraries like leftpad as well, though?


Leftpad is more a symptom of JS lacking a real standard library than anything else.


Yes. One of the costs of having a low barrier to entry for publishing is that the quality is generally lower. NPM still hosts tons of very well written code, but you have to sift through quite a lot of noise to find it.


> Sure, I wish I could have used a library. Unfortunately, once you start adding requirements, the options start to vanish. It needs to be a C (or C-callable) library. It needs to accept input as Unicode strings (lines), not just files on disk, or bytes. It needs to return results line-at-a-time. It needs to have a compatible license. It needs to support the same package manager as I'm using (or be a trivial to include, like a single file). It needs to be reliable (essentially bug-free, or at the very least currently maintained). It needs to be fairly efficient. What library supports all these?

I always thought people just used libcsv in cases like this. If that's missing something though, then I believe my Rust csv library[1] would be good enough if I added a C API, which would be pretty easy and I would be happy to do it if it really filled a niche. I think the only hurdle at that point would be whether you could tolerate a Rust compiler in your build dependencies.

[1] - https://docs.rs/csv-core/0.1.10/csv_core/


> I love Ruby, but that file looks a little nuts to me right now. The block comment at the top of csv.rb is over 500 lines long. The Ruby module tries to do everything...

FYI: CsvReader [1] is a more modern alternative to the old standard csv.rb library in ruby.

[1]: https://github.com/csvreader/csvreader


I wish the JVM was less of a pariah. I love being able to just reach for https://commons.apache.org/proper/commons-csv/user-guide.htm...

Here is the entire logic of the CSV parser (test code is in a separate path) https://gitbox.apache.org/repos/asf?p=commons-csv.git;a=tree...


I'm not as anti-JVM as I once was, though the recent licensing changes have left me confused and nervous. I read their licensing FAQ (again) and I still can't tell if I'm allowed to include recent versions of their JVM in my application.

Were I to use the JVM for my background processing needs (the thought has occurred to me!), I'd definitely use Clojure -- 'data.csv' is less than 150 lines, including comments, for both the parser and formatter!


I suggest looking into very specifically AdoptOpenJDK.

https://adoptopenjdk.net/faq.html

A lot of people have been digging into the general insanity of Java licensing since Java 8's divergence. Here is a good overview https://medium.com/@javachampions/java-is-still-free-2-0-0-6...


In what sense is the JVM a pariah? It's one of the most successful software platforms.


- Oracle's licensing confusion (where does Java and Java(TM) begin/end).

- The JVM historically has included the kitchen sink (Java 9 started addressing this).

- People's impressions that it is slow or a memory hog. Edit: (Looking at you Eclipse/Intellij/Glassfish/JBoss/Atlassian)

- Experiences with bad code bases due to it's ubiquity and age. Old code bases tend to evolve in interesting ways.

- It's not the new shiny thing.

I have meet plenty of developers who dismiss Java out of hand. I've also seen DBA's dismiss FKs. I don't think how good/bad it is has anything to do with people's dislike of it, more that there is a certain vocal group that dislike it for being the "enterprise" solution.


But this does not correlate to how used the JVM actually is in the outside world. The JVM is huge. It's not a pariah. The perceptions of JVM slowness have been debunked again and again, and anyone who cares knows they are mostly a red herring and have been so for years.

Java and the JVM are mature, battle-tested, widely used technologies in the industry, easy to hire for and with multitude of libraries for almost whatever you want to do.

It may not be the latest shiny new thing, but that's different. Pariah means an outcast. Neither Java nor the JVM are outcasts.


having been bit by both reinventing the wheel and overly-relying on third party code, i find that my position on this varies from week to week.


As I find myself writing my own PHP quasi-framework from scratch in 2020, I am definitely feeling this. On the one hand, reinventing a wheel like this is utterly insane. On the other hand, for the simple project I'm doing this for, nearly all the other wheels I've found turn out to be monster truck tires, milspec tank treads, or waterskis.


I am betting that Ruby library still can’t handle a good 50% of the files that people throw at it because there’s not such thing as CSV standard.

On the other hand writing a custom parser using some off-the-shelf parser combinator library is easy and straightforward (ie I know how much time it will take to implement one and what exactly it does).

Sometimes having multiple implementations of the same is better than a single generalised implementation struggling from combinatorial explosion.


True, is not rocket science to write a CSV parser. The problem with CSV is it looks so deceptively simple so people write simplistic ad-hoc parsers/exporters all the time, so we end up with a bunch of incompatibilities.


There is a big difference between writing an application and being a library author.


One thing that has always puzzled me is why CSV exists, given the ASCII unit/record seperators appear to do exactly what the comma and newline characters are attempting to do with codes which could appear within records, and have been well defined since some point in the 60s.

And here we are, 60 years later still struggling to work out where a record ends...


Easy: CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard, while they don't have the ASCII record seperator visible (teaching them to use eg Alt + 30 will not work).

And if they had that key on their keyboard then you'd have the comma problems all over again: What if a ASCII record seperator shows up in the field?


If the ASCII separators were supported regularly in text editors since the early days then it would remove 90% of the problems.

I prefer the | pipe character as a delimiter - easy to see, not part of common speech or titles, and enterable via keyboard. Yes, it can exist in the field but less likely.


But if it _might_ turn up, you have to assume it will and deal with it, and then you're into the exact same problem as before, but with something even less 'standard' than before.

I think it's like cryptography. Why bother to roll your own when there are people who are cleverer (certainly than me. I don't know about you) who've already put a lot of effort in to this, so just use one of the well tested standard libraries and don't mess with it


I'm not recommending "rolling your own". You still use CSV libraries, but with a pipe delimiter instead.


It looks a LOT like l and I.


Luckily code sees it as ASCII 124. If you're saying that for compliance, | can be converted to l/I then it's a feature, not a downside.


> What if a ASCII record seperator shows up

point is, being not human readable and having no keyboard key, you can reasonably expect those special separators not to.

I suppose I'd consider such special-separated files (26 to 29 from memory?) to be machine generated and machine readable only, not intended as human readable without a bit of extra software or eg. a special emacs mode


You're almost certainly describing a file format that already exists and that nobody uses because it is not human-interactable.

Anyone can edit a csv. I often have. An important feature.

The only real stumper in the CSV format is why they didn't use \ as the escape in strings like everyone else. Probably some good reason.


While backslash escaping is very common, double-quote (or double-special-anything) is fairly common as well. I use it all the time in SQL.


Interesting. I don't think I have ever hand edited a CSV (unless you mean created from scratch, well then, perhaps).

OTOH the ability to just look at it... that I've found very valuable and agree there.


When considering standard A that can only work with machine generated and machine readable files and standard B that can work with all those files plus human-readable/editable files, I don’t find it surprising at all that standard B wins in the market.


It happens. XPM/PBM/XBM lost to TIFF (and later PNG). Postscript lost to PDF. OpenOffice.org XML couldn't displace Excel's Binary File Format.

As much as I wish that beauty or usability was the primary indicator of market success, the simpler explanation that explains all of these (including CSV) is: Microsoft Office supported them.


I don't know about XPM/PBM/XBM lost but AIUI PDF is just a subset of postscript with compression. If you think MS's binary excel file format is all that secret, I suggest you try unzipping it - it is no more than compressed XML. I have rescued parts of corrupt excel and word files by doing just that.


MS completely redesigned their binary formats a few years ago. You're right about the current format, but the earlier one was completely undecipherable. Even the current XML has quirks in it to be compatible with the older format.


It was reversible enough to write the Apache POI library. I maintain code that uses the HSSF streaming API portion of the library, which mainly gives you low-level access to the structures in XLS files. The OO.org folks made some very nice documentation: http://www.openoffice.org/sc/excelfileformat.pdf


PDF diverged from Postscript years ago.

Microsoft’s binary Excel format is not XML -- you’re thinking of its replacement. It was COM/OLE or some such.


Binary formats win I think when the ASCII format is significantly greater in storage size and parsing speed. XPM/PBM/XBM are enormous compared to any binary image format. And remember these images used to get put on 1.4meg floppies and sent over 2400baud modems.

It's basically use the right tool for the job. For pixels that's a binary format. For mostly text, text itself works pretty well.


There is no standard but one for CSV files (the RFC 4180 mentioned).

The whole article pretty well rams home the fact there is no standard B because there is no standard which is adhered to.


Humans are not bound by the computer science definition of “standard”.

But you’re correct that I was using the word “standard” imprecisely to mean something more like “file extension” rather than an IETF RFC.


If someone copy-pastes a line from one CSV to a string within a field of another, the second will have record separators from the first within a single field.


> CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard

Do people usually compose CSV data by hand? I thought they would use a program like Excel, enter or generate their data in separate cells and then save the file as a CSV. There's no reason why a program like Excel (or any other program for that matter) couldn't use the record separator instead of a comma as a delimiter when generating or consuming such files.


Compose, rarely. Read and edit with plain text editor, it's common.


I did it like three times already today.


What editor did you use?


vim / sublime text, depending on context.

I do a lot of troubleshooting using CLI tools like grep and wc.


I do quite a bit for GitHub markdown tables, for making small annotations for ML, for python parsing, etc. Very common and would be annoying too open up excel every time I needed some csv.


I thought markdown tables used the pipe character as the delimiter between table columns.

Also, is not necessary to use excel, and a lot of the issues brought up in the article wouldn't be the case with a different delimiter character. Editors could easily be updated to make entering that delimiter character easier to enter by mapping it to a key like tab whenever a file like that is opened.

It just seems like a problem that could be solved, but can't be due to inertia.


> I thought markdown tables used the pipe character as the delimiter between table columns.

Markdown (the original one by John Gruber) does not include any syntax for tables. Other implementations have included it, but there is no standard. As far as I know, using the pipe character to separate columns is indeed quite common, but it is not the only way.


In my experience, they'll have a database that they administrate through excel, then output it as CSV for whatever tool that needs to process it.


But humans rarely use notepad these days. They use code editors like Coda or VSCode at the very least which have all kinds of advanced features. Surely, those can include support for ASCII separators?


Let's say we have a product master. Something looks fishy and we want a brand person or product supply person to check... Push a CSV and it's trivially easy for them to open in XL and check, make corrections and send.

In a business context, this happens far more often than you may expect. Sure you can build a custom platform to validate and make people connect to a form connecting to a database - but Excel is a great user interface and everyone knows excel.

A funny incident - we were struggling to build a complex workflow in KNIME where at some points we need user input. Nothing out of the box was great - tools either assume a dashboard paradigm or a data flow paradigm - nothing In between.

One of our creative folks came up with the solution of writing to a CSV and getting KNIME to open it in excel. The user would make changes and save, close excel and continue the workflow in KNIME. Even completely non technical people got it.


We had to ban our non-technical users from opening CSVs in Excel because it would strip leading zeros and do other exciting things like convert large numbers to 1.23123E+20 :(

It does this without any notification or warning when you save.


Kill me...i've dealt with flat file ETL for ~9 years now, with the first 2 pretty much dedicated to it. I hate hate hate Excel for no reason other than that. It has wasted so many hours due to customers messing up their identifiers by opening up their feeds in Excel prior to sending, and accidentally transforming like above, then having to fix changing customer identifiers used throughout the system once they figure out what happened.


Excel can only display CSV files. I had to ban myself from editing files in Excel because it’s atrocious at it. It also likes to somewhat change decimal-seperators (again, without any warning).


Even technical users have trouble round tripping data to Excel.

https://genomebiology.biomedcentral.com/articles/10.1186/s13...:

“A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.”

That paper is from 2016, at least 12 years after the problem was first identified.


East coast zip codes with leading zeros are a real problem.


That behavior is technically correct because those are all valid representations of that value, e.g., 007449938 = 7449938 = 7.449938E6. Pretty much any plain text (or binary) data format will have this same issue, unless it supports additional metadata for display purposes.

If you intended the data to be a string value, then it should have been enclosed in quotes.


The problem is that when you have a number like 123451234512345, it may get rewritten as 1.234512E+14, and then you are loosing information.


Your assertion sounded incorrect so I tested it. Quoting large numbers does not preserve them on a save.

Adding a ' before it should work, but for data extracted from arbitrary systems you'll have few guarantees of the sort. Formatting the cells in Excel to display the full value instead of the truncated value also works from memory, but you won't always know ahead of time if this happened later in the file.

In our case it was often mailing identifier barcodes so any loss of precision made them entirely useless.


What if the value is an ID rather than an integer? CSV doesn't specify whether the data type is a string or a number. The schema needs to be inferred by the parser/user.


> If you intended the data to be a string value, then it should have been enclosed in quotes.

That's not part of any CSV specification I've seen, including RFC4180.


but if using the ASCII separators was common, excel would support that and it seems like this workflow would be unaffected?


People still need to open them in non-excel apps. And the format may predate widespread use of spreadsheet software.

Fun fact, Excel will truncate numbers beyond 15 digits unless prefixed with a single quote mark.


> But humans rarely use notepad these days. They use code editors like Coda or VSCode at the very least which have all kinds of advanced features. Surely, those can include support for ASCII separators?

I do a fair amount of work with companies that do "EDI" over CSV (or worse CSV-like - think 2 CSVs jammed together with different formats, no headers, no support for escaping or quoting) and fixed width documents. I can absolutely assure you that humans do open these files in notepad far more often than I'd like.

Often one of the main reasons they don't use things like X12, ASCII separators, etc. is because a "human needs to open it at some point" was a prevailing business decision some number of years ago (think "what happens if the IT system fails? how can we still ship stuff even in a complete emergency") and now it's baked into their documented process so deeply its like shouting into the wind to alter things. Third party warehouses are the worst at this.


Developers are by no means the only humans that use and/or edit CSV files.


Developers rarely use notepad. Non technical users still have to edit csvs by hand, all over the world.


The average CSV use is as an export/import from excel and most of those users don't have specialised developer editors.


I have not seen very many business analysts open an IDE. In my experience, most edit CSVs in Excel.


I don't think your first line is at all true.

Modifying Csvs with notepad is rife. I'd wager more than using any one particular code editor.


Or they use Excel, which can introduce its own entertaining errors.

A few years back while working on something that unavoidably used large quantities of CSV data we would sternly warn people not to use Excel, but people still did it ("But I just looked at it!").


The TSV (tab-separated values) format is supported by a lot of software (Google Sheets, for instance) and solves the "comma in the field" problem quite nicely. Tabs can easily be typed by a human, but they don't usually occur within a one-line string, so they don't cause the same problems as CSVs do.


TSV, CSV, PSV etc. all share the same class of problem. The important thing is that you are quoting your fields and escaping your quotes, regardless of the symbols used.


Importantly, tabs aren’t usually inserted into fields by Excel users. With Power BI, exporting TSV is a little easier for non-dev Excel power users, too.


Nailed it. When Excel users press tab, they mean "go to the next cell." Conveniently, this is also what the format means to the computer when it encounters the tab separator. It's not foolproof, but it's much better. The human interpretation of the character matches the machine interpretation.


“Don’t usually occur”

They do. People write code to create all sorts of bastardized abominations of “csv” or tab delim or whatever. It’s why the featured article gets reposted every few years. You can define a standard for csv files, but then Excel does it’s own thing and here we are.


I think that stems from the fact that no software ever properly supported them, they are non-printable characters so you can't see them easily. Whereas with CSV you can fire up notepad.exe and explore the data.

Given the inability to standardise on line endings "\n" "\r\n" or "\r" I don't think we can standardise on using unit or record separators


No, we certainly can’t expect people to use the ASCII ‘Record separator’ character as a record separator, that would be madness!


Have you ever seen the ASCII separator characters used as they were intended? I don't think I have. It's obvious the problem they were trying to solve, but it was too little too late. It doesn't help that they're control characters that aren't meant to be displayed so they're practically invisible.


CSV is human readable and writable, but that is only part of the advantage of CSV. Its simple format is great as an archival format. It's the plain text of data formats. Decades pass and you can still read it without much of an issue, though people are rightly pointing out that the format is ambiguous is nature and subject to a lot of interpretation. I've recently been compiling a lot of decades old experimental datasets, and it's honestly great when I find plain text files from the 1980s, since we can still read them! It's the more exotic data formats that came about later that are often more difficult to read properly.

The Library of Congress lists CSV files as one of its preferred formats for archival datasets: https://www.loc.gov/preservation/resources/rfs/data.html


Bacuse csv is human readable and human writeable


So is tsv, but for some reason csv is more popular despite being the worst character to use.


For a lot of people, meaningful space is a hard concept to understand.

It's made worse by the number of applications where they will see spaces reformatted willy nilly (for instance web forms eating their line breaks.

From there explaining there are spaces and tabs, and that both can look the same on screen but they are different is just asking for trouble outside of our circles.


> is just asking for trouble outside of our circles

it's asking for trouble in our circles too. Imagine getting a tab-sep file, opening it up in an editor and having it automatically convert to 4 spaces "because", then sending it back without checking something that doesn't look wrong.


Or an input/editor where the `tab` key moves your focus from the text field to the next button. Like Slack. You can't just hit `tab` in Slack when typing a code snippet for something that needs them like a Makefile.


What is human readable about a tab?


Open up a csv in notepad vs opening up a tsv.

One actually looks like a table the other is a garbled mess.


I think we have different definitions of human readable then. To me, human readable means if I print it in a printer, it loses no fidelity. What you describe is what I would call machine-readable. It is able to be imported into a program. The same that a binary separator would also let it be imported.


Even in your example TSV wins.


If those characters had a visual representation and were easily typeable and we settled on using them as separators then we would struggle escaping those special characters within records. The problem is lack of standardization, not lack of special characters for record separators.


ASCII FS/RS/GS/US do not have a visual representation, and users do not know how to enter them. That means they're essentially for binary-like encodings -- they're not really text if users can't see or enter them easily.

But users can see the commas in CSV, and they can trivially enter them. Yeah, the result is messy.

The lesson here is that the separator control characters should have been visual and had a visual indicator on keyboard keycaps to indicate how to enter them. Because they aren't and didn't, they are essentially useless for text.

EDIT: I do happen to know how to enter these on *nix systems. The ascii(7) man page tells you how:

       034   28    1C    FS  (file separator)        134   92    5C    \  '\\'
       035   29    1D    GS  (group separator)       135   93    5D    ]
       036   30    1E    RS  (record separator)      136   94    5E    ^
       037   31    1F    US  (unit separator)        137   95    5F    _
so FS is ^\ (which you have to be careful does not cause the tty to generate SIGQUIT -- you have to ^V^\), ^], ^^, and ^_. That is, <control>\, <control>], <control>^, and <control><shift>-. On U.S. keyboards anyways.


Simple Google explains why:

- non printable character - no single keyboard key - intended for systems, not people - etc


Yeah, we've never had similar issues with line-endings ;)


I think CSV is more popular than one would think because Excel was awful back in the old days.

A lot of the crufty edge cases are artifacts of whatever people had to deal with in 1995 to get data into spreadsheets. We all got used to pushing data around with CSV because the people consuming the data needed it.


That's funny, when i was twelve i need to store some data and i didn't really know database or csv, so i actually used the ascii control character to implement tables and record. I later discovered sqlite, soooo much simpler...


This, a thousand times.

As a fun exercise write a CSV-like spec but using those ASCII chars.

    table = [row.split(unit_sep) for row in data.split(rec_sep)]
There's your ASCIISV decoder.


Based on my understanding of CSV, unit separators are valid column delimiters. I doubt you could use record separators between as row delimiters though.


Wow never heard of it before. The biggest problem is Excel and text editors rule the small data world.


This scary stuff only applies if you need to read garbage CSVs from a million different sources. Even then, like ken says here, that's only like 10 issues to figure out.

Exporting is likely the more common case and is pretty simple. Quote fields if they contain \r\n",

I'd say relax, focus on your particular problem - if a simple 2-3 line export loop fits your use case and is simpler than a csv library (e.g. in C/C++ perhaps where dependencies are a pain) then why not.

This fear mongering about "you can't possibly do this, the library writers are much smarter than you" can lead to it's own ridiculousness of hundreds of dependencies and ungodly waste of processor time, memory, build time, containerization hassle, etc. to solve what should be a simple problem.


> This scary stuff only applies if you need to read garbage CSVs from a million different sources. Even then, like ken says here, that's only like 10 issues to figure out.

Well, garbage CSV's are garbage CSV's. They can be inconsistent, and if you're reading millions of them (OK, even hundreds) it can be a real pain-in-the-ass data janitorial task-- the equivalent of cleaning a public bathroom after homeless people have sprayed diarrhea all over the floors, walls, and ceiling.

The hard part of dealing with the "10 issues" is you have to figure out which ones apply to which files (or even lines). A CSV linter is often helpful to diagnose, that would be the janitorial equivalent of rubber gloves and a bucket of soap, but it's still a lot of work.

If you truly have millions of inconsistent files to deal with, the only way to keep sane is to categorize them by some parameters that make sense... size, date, keywords, even word-stats. Then, you can programmatically tackle each category. It's a messy problem no matter what and gets more ugly as the scale increases. And all of this is often before you even start what you actually need to do.


I agree. Sometimes the business case is needing to read .CSV files that have some pretty bizarre properties and you need a massive library that takes care of all of that.

In my case (I do a LOT of .CSV work) though none of these problems exist. I haven't reached for Python's .CSV library in years and neither have my co-workers. We simply loop through the file, split the strings on commas, and have a few if statements to parse on a situation by situation basis. Extremely old school, but it works very well and is easier than dealing with the objects that are generated from using the library. I realize this probably doesn't work with all use cases on HN though, or even most.

My ideal language has direct support for making file I/O, dictionaries, sets, whatever all in core and easy to use. This keeps me from having to write my own helper classes and modules or bring in 3rd party libraries. Other languages like to have a community package for everything to where you have 9 options for everything. This seems to be common in JS and Perl and is certainly valid, but I'm not a big fan. Python hits a sweet spot for me.


There's a balance. Massaging wacky csv feeds to actually be parsable because someone wrote their own incorrectly is a real problem. Seen it dozens of times.

I think the real problem is programming has become too easy. Almost nobody really has to worry about data structures, memory management, any of the hard problems that filter for competency.

Some wacky insane stuff runs just fine on these 8-core multi-gigabyte ram computers we literally have in our pockets. A lot of it is on a VM inside of yet another VM, even that's fine. Just about every bad idea seems to work.

These kinds of issues are to be expected.


This is my take as well. Most of our CSV concerns center around generating them, not consuming them.

Generating a valid CSV string that Excel wont complain about is approximately 2 lines of LINQ if you have a clean starting point (i.e. a list of class instances with strongly-typed members).

For cases where I have to read garbage CSV, I first go to the source and determine if we can't get a better export or interface into the data. Most shit CSV I see is a consequence of someone hacking together a SQL query to produce the output. I'd be inclined to just request a database export at that point. Obviously not always an option, but you should always insist on having the best representation of the business data before you start banging your head against the wall on parsing it.


I have a saas for non-devs to open big CSVs, so code I've written has parsed thousands of garbage CSVs (and failed to parse thousands others).

Lots of pitfalls are easy to avoid if you already know them, which it sounds like you do. The author is saying that most engineers aren't aware of these issues, and the author is correct.

Writing a for loop is easy, but so is using a language's built in CSV writer.


Learned helplessness. You will never git gud if you're just copying and pasting libraries. I blame npm.


The Pandas CSV reader (Python) is one of the best and most versatile implementations out there. Also very fast.

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...


Also xsv [0] and the related Rust csv crate [1]. xsv in particular is the best CLI csv tool I've ever used.

And I've used them all.

[0] https://github.com/BurntSushi/xsv

[1] https://github.com/BurntSushi/rust-csv


Wow! That's a nifty little library. Just looking at the amount of options going into the main parser routine is mind blowing [1].

And the figuring out of encoding is also complex [2].

Once again, this is just to demonstrate that writing your own CSV parser from scratch is a total waste of time. Just use tools provided by your language. Many languages provide native support for CSV parsing. For example see docs for Microsoft C# [3].

----------

[1] https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/io/p...

[2] https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/io/p...

[3] https://docs.microsoft.com/en-us/dotnet/csharp/programming-g...


string.Split is far from native support.

If you don’t want to use third-party libraries then TextFieldParser[0] is part of the .NET framework.

I use CsvHelper[1] in my projects. It can do pretty much anything although the newer versions have some dependencies.

[0] https://docs.microsoft.com/en-us/dotnet/api/microsoft.visual...

[1] https://joshclose.github.io/CsvHelper/


Disagree.

Data.table’s fread is leagues ahead of pandas.

https://www.rdocumentation.org/packages/data.table/versions/...

Fread has automatic footer detection and has automatic skip logic to help parse out mangled headers in some csvs.


Additionally, RStudio has a built in GUI widget for importing .csv (and other) files: https://support.rstudio.com/hc/en-us/articles/218611977-Impo.... I find it quite useful for breezing through the common gotchas with importing data, and it’s probably the easiest way for inexperienced programmers (e.g. business analyst types used to working in Excel) to work with .csv.


fread can also accept a shell command which it will run and use the result as input. I use that to grep the lines I want from files too big for RAM.


Same here!

I also use fwrite to send it back to the shell to continue my processing pipeline.

https://www.rdocumentation.org/packages/data.table/versions/...

It's mind boggling how little fanfare data.table has despite it being the best way to handle tabular data in 2020.


fread is amazing. Whenever I look at the docs and see all the options for dealing with nested lists (like the within-column-seperator) I always pray never to have need of such a thing.


In what ways is it better than the csv module in the Python Standard Library? Or in other words, in which cases is it worth the effort to install pandas just for reading CSV files if your project otherwise doesn't use pandas?


I wrote a Python utility to help our folks at our analytics company load/unload/move data among CSV files and SQL databases.

Getting the parts that work with CSVs to "just work" when presented with an arbitrary CSV file has been super interesting - even with Pandas at our disposal.

https://github.com/bluelabsio/records-mover/blob/d18ec02fdf5...

I wrote up some of the experience here:

https://github.com/bluelabsio/records-mover/blob/master/docs...


Are we talking "fast", or merely "fast for Python"?


It sounds like you may not consider anything python can do to be fast, so I'm thinking the latter.


The built-in

    import csv
never failed to do everything I needed, what does this Panda thing add?


In Rubyland there's the versatile CsvReader library / gem [1] that supports HXL, ARFF, CVY and more "exotic" or modern flavors / dialects among many others.

[1]: https://github.com/csvreader/csvreader


N.B. Floats won't rountdrip by default (you need to call `read_csv(..., float_format="round_trip")`)


Oh .csv, how I love to loath thee.

I have a funny story about .csv. Back in the day I was working on an integration with Fishbowl. Looking through their docs, the data format was your typical XML type stuff, until you got to the import requests. And there, lo and behold, was XML wrapping, you guessed it, .csv. I literally laughed out loud.

https://www.fishbowlinventory.com/wiki/Fishbowl_Legacy_API#I...

Years later they released an updated API. Upon checking it I discovered it's now JSON wrapping .csv, so, you know, progress.

https://www.fishbowlinventory.com/wiki/Fishbowl_API#Import


Paths in SVG are another example where the designers couldn't bring themselves to do everything strictly in XML and decided to use (space) delimited values:

   <path d="M 100 100 L 300 100 L 200 300 z"
        fill="red" stroke="blue" stroke-width="3" />


This is sort of a quasi standard way of describing paths as drawing operations. You'll find this in working with fonts as well and old plotters sort of.


> Do you really still want to roll your own code to handle CSV?

No one really does, but we get backed into it due to these various issues.

The general issue is that there is no real standard for csv, yet it's often used as the touch-point for integration of disparate systems!

The integration goes something like this:

1. A and B decide to integrate! A will accept data from B.

2. A and B want to use a good interchange format, but after weeks of intense negotiations B can only get her IT department to agree to deliver csv (and feels lucky to get that much).

3. B uses excel to provide sample and test files, which are used by A for development or even in early stages of the live integration. Things seem to be going smoothly.

4. At some the export process at B changes, a new csv generator is used and now things break on A's side. Or something on the data input side changes at B so that new forms of data are now present that their csv generator does handle the way A expects, etc.


Eh—I had to do it recently. There was a file format that I was dealing with that was basically a combination of CSV for data and Yaml for headers (and some other edge case stuff) and the Python CSV reader couldn't handle some edge case that we needed handling, mostly around reporting back to the file provider where they messed up one thing or another in the file.

It didn't really take me that long. Lots of easy to write tests and the thing is processing a ton of data now for a well known stock exchange. Far easier than, say, the machine learning stuff I've written.


The breakage can be subtle and unnoticed for a while, and it can be in the future. It sucks.


Funny story on this subject. When I was transitioning from academia to industry, I was afraid (having browsed hackernews so much) that I would be harshly judged for reinventing the wheel. So I removed the CSV library I had written from Github. Shortly thereafter, I was emailed by a well-known programmer (the creator of a fairly major language) asking why I had removed my CSV library, which he was using in one of his projects!

Here's the library, btw: https://github.com/semitrivial/csv_parser


I think the most fun I had parsing a CSV file was when I found out it switched from UTF8 to UTF16 after several thousand lines, as differently encoded CSV files had been concatenated together directly into a single file...


When you hit that kind of data which isn't quite a corner case anymore but already in the realm of subtly broken, you'll actually be happy to have written your own ad-hoc CSV implementation. Figuring out that problem will be much harder if, because programming is scary, you are using a black box library.

Proper CSV handling is simple, just don't try to take shortcuts to make it even simpler than it is. There just happens to be a lot of CSV around that is too broken to be read without a bit of handholding, but no gold standard library will be able take that problem out of your hands (an argument might be made for libraries on the producer side of files). That little list of pitfalls in the article would be a good guideline for "things you should know before" but it doesn't make a particularly compelling point against rolling your own.

Personally I'd even argue that CSV is the only format where, on the consumer side, rolling your own is actually advisable. Because every beyond-corner-case is different and the format is such fertile soil for imperfect as-hoc processing that happens outside of your code.


I have a similar story: parsing a CSV file that had an unexpected 0x1A byte in the middle. Turns out the file was concatenated together in multiple passes from multiple files using the COPY DOS command, which by default adds ASCII SUB to the end as an EOF marker: https://en.wikipedia.org/wiki/Substitute_character


I have done this to myself in the past. Those were....dark days.


The sad thing is that this happened last year.


You don't say, someone sent a CSV like that to me yesterday...


I've seen some terrible CSV files where people tried to squeeze hierarchical data into a flat file format. Usually the customer can't provide any sort of schema or documentation because if they could they would have recognised CSV isn't appropriate. This means you are in an endless cycle of fixing the code as unexpected items keep appearing in the data.


I tend to think of dataframes (R, Python pandas) as the standard way for representing and working with the data contained in a .csv. “List columns” (https://jennybc.github.io/purrr-tutorial/ls13_list-columns.h...) inside dataframes can actually be a convenient way to represent data that is mostly flat, but has some hierarchical data associates it (recent example: dataframe representing a corpus of documents, with a column containing a list of citations). So while .csv probably isn’t the best format to store hierarchical data, it’s fairly trivial to work with if encountered.


Try a UTF8 CSV with Windows-1252 free-form text in some fields. I believe dealing with character encoding is the most frustrating thing a developer can do.


> I believe dealing with character encoding is the most frustrating thing a developer can do.

I'll take your encoding work if you take my timezone work.


Well, considerations like these are typical for more or less any kind of software you will ever write. Because of this all but the very simplest software needs automated tests.

But one may also note that your particular usage of CSV may be a bit simpler than supporting all of the possible complications. Especially if one is not reading CSVs that could come from anywhere.

It depends, having dependencies is always a risk and a nuisance as well. The library you are using may have bugs, acquire bugs in the future, change in incompatible ways and so on.


At a former job we processed files, many CSV, from hundreds of different sources and I was in charge of cleaning up the code that did this once. There were a few tiny binaries without source called csv2tsv and csv2tsv2. No documentation of any kind of course.

csv2tsv just handled quoted fields and I was able to replace it with a few lines of Python without issue.

The csv2tsv2 program was used for CSV files from exactly one company. We couldn't ask them for technical assistance - it's likely their systems had been written years ago and running without updates since then - so I tried to figure out what the binary was doing. The input file had some null characters in it, but they seemed to be used inconsistently, and I was never able to figure out what that binary was supposed to do.

I left that binary alone and kept using it for a few years before a new guy joined and took over the system from me. I mentioned this weird old binary to him and over the course of a week he poked at it now and again before figuring out what it was doing. He used to work at a bank and realized it was using the same quoting method that some old data format they'd used there did - something about doubling characters and a few other tricks.

There's nothing simple someone won't make complicated.


You are thinking of these: https://www.w3.org/Tools/csvtotab-vv

Agreed, they are excellent.


CSV is awesome :-). Leon Bambrick curates an Awesome CSV Page [1] and another Awesome CSV Page [2] by my humble self.

[1]: https://github.com/secretGeek/AwesomeCSV [2]: https://github.com/csvspecs/awesome-csv


> When the final cockroach breathes her last breath, her dying act will be to scratch her date of death in a CSV file for posterity.

CSV is the Keith Richards of file formats.


Thanks Gerald ;)


I can relate. I have been one of those who started writing a CSV library and went down the rabbit hole of edge cases and configurability. Looking back (and only thinking on productivity) my time would have been better spent doing something else. I did, however, learn quite a lot along the way:

- how text is represented in a computer (and all its caveats),

- how to identify and tune slow code,

- how to vectorize code (thanks to Daniel Lemire's https://www.youtube.com/watch?v=wlvKAT7SZIQ ),

- how to encapsulate code better,

- how to use some great CSV tools out there (thanks to Leon's https://github.com/secretGeek/awesomeCSV )

- how to better manage and communicate an open source project.

All in all, I am still not sure it was a mistake. Moreover when Swift didn't really had, at the time, a good Codable interface to CSV. Next time I encounter a similar problem I will probably just wrap a C library and create a good interface, though.

In any case, if you are coding in Swift and find yourself in need of a CSV encoder/decoder, you might want to check https://github.com/dehesa/CodableCSV


My initial comment was going to be that "The Python CSV library is only 448 lines (378 LOC) [1]".

Only later did I realize that the bulk of the work is done by the C library, which is ~1500 LOC [2].

I guess parsing CSVs reliably is a reasonably difficult undertaking.

[1] https://github.com/python/cpython/blob/3.8/Lib/csv.py [2] https://github.com/python/cpython/blob/4a21e57fe55076c77b0ee...


I have a base “library” I’ve used for several years that wraps the csv module (specifically DictReader/Writer) to handle encoding issues as well as data parsing. Have run into minor issues, but think that module alone is one of Python’s best use-cases.


And this is just the tip of the iceberg for data importation.

Managers and customer facing folks want to say Can't you just use off the shelf software for this? If your data importation needs are at all non-trivial, the problems tend to come in 2 categories.

1) Error reporting from off the shelf software is inadequate. If you are lucky, it will consistently point you to the correct row number. It's usually lacking any additional info.

2) If your application is at all non-trivial, you have specific importation needs, not easily covered by 3rd party generic solutions. For instance you want to import well formed tables with primary keys. Lots of solutions claim to handle this case, but you usually run into problems under category (1).

A programmer can overcome these difficulties and muddle through 3rd party solutions, but you probably don't want to hand this over to users. It just won't be a good experience for them.

I completed an importation library tailored for our specific needs and handling the real problems we have experience over many years. The front-end project to make this available to users started and stopped and changed hands over months. We are finally nearing deployment, and so far the users we have shown it to like it.


An additional problem is that Excel, the tool many people use to open CSV files, uses the computers regional settings for decimal separator and thousands separator. CSV files are used as an interchange format, so the format of the numbers in it does not depend on the region of that specific computer.

You can tell Excel to always use '.' as a decimal separator, but then it also uses it for presenting numbers to the user. It boggles my mind that software like Excel doesn't understand the difference between reading/formatting numbers for human use and reading/formatting numbers primarily meant for talking to other software.


Excel, while amazing on its own, is horrible for exchanging data in so many ways. I've maintained a policy at my company (life sciences) not to accept Excel files in any automated process; while parsing csv/tsv is a pain, xlsx/xlsb/etc. would amplify that pain in so many ways both predictable and surprising.

https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-al...


Parsing proper Excel files is fine - you just have to use a proper Excel parser such as EPPlus or Aspose.Cells or similar. In lots of ways this is actually easier than handling CSV files from random sources.


I've done it with Apache POI, and while that 'worked' it doesn't wash away the fundamental problems with Excel datatype guessing. CSV is not great either (no fun explaining encoding or invisible control characters and unusual whitespace to the average IT person) but it is nevertheless less bad.


Excel will also take a number with a leading zero and drop it while it converts it from csv, for ex 01234 -> 1234, which is SUPER bad if you are dealing with any sort of IDs to invoices or work orders from outside the company


That is a good occasion to curse Excel (how does it know the number isn't octal, too?) and also curse the person or group that designed a textual coding system consisting of digits with leading zeros. Looking at you, National Drug Code.


FWIW, the new importer for Excel lets you easily preview and adjust this behavior before importing the entire file.


Well sure, if you're writing a generic CSV parser, it's very complicated. But if you know what your input looks like, CSV can be one of the simplest format to parse.


I think you need to be careful with choosing to use CSV format.

CSV works fine as long as you are not handling character stings. Character strings get messy as soon as they might introduce commas or characters outside of ASCII.

For tabular data which has character strings I prefer to used the ODF format http://opendocumentformat.org/developers/ (.ods file extension) which has good import and export capabilities from Excel, or Google sheets. If the user needs the data in CSV for entry into another application then they can convert within Excel and handle any conversion problems themselves.


Do you have a preferred library for python for example)?

Debian packages python3-odf and https://github.com/eea/odfpy seems to still have some activity (vs the other listed).


I don't. In the past I have just used an XML and a zip library like this article explains https://www.codeguru.com/csharp/csharp/cs_data/xml/article.p...



I am the lead developer behind the Node.js CSV parser [0]. The project started in 2010, an hour before me leaving work for the weekend. It ended up being a long, very long journey that isn't over yet. The project grew incorporating many features I would never have expected, fixing corner cases the several hundred tests were not covering and teaching me a lot about coding and project handling. The bottom line, writing a correct CSV parser takes a little bit more than implementing an advanced string spliting function.

[0] https://csv.js.org/


There's this system, which has become extremely popular for processing data (Apache Spark). And its default configuration cannot parse CSVs properly and can lead to pretty severe data loss. I wrote about this a while ago and it's still an issue. (Yes, we did lose data his way, it's not a theoretical exercise.)

https://kokes.github.io/blog/2019/07/09/losing-data-apache-s...


I wrote my own CSV code rarher thsn using a library because the input CSVs were produced by guys who didn't use a library. The columns had a fixed "real width" but also had a column that contained an indeterminate number of comma-separated, values. One can escape commas in CSV for just such a purpose, but these guys didn't. So, to deal with this issue quickly, I had to basically parse the line forwards until I hit the variable-length field, then backwards until I hit it as well. Then I could parse the field to determine how many values it had and extract them.


My CSV story from last night:

I had these 15MB+ excel sheets and was trying to open them with Apache POI. I gave that code a generous 2GB memory: GC overhead limit reached.

Then I opened them in Excel, saved them as CSV, reached for a CSV-library and was done within seconds.

Ok, well, the story is more about how parsing XML can ruin your day ;)


Are you using POI’s eventmodel APIs to handle such files? The SAX model isn’t too bad for reading files.


It was not emphasized in the article, but the quotes can be both single and double.

That said, it's futile to try to write universal CSV parser. If you need to write your own, just make it possible to choose delimiter and type of quotes and call it a day. LibreOffice Calc does same.


I coded a few robots.txt parsers i.e.: (these are just the ones i cared to publish) https://github.com/franzenzenhofer/simple-functional-robots-... https://github.com/franzenzenhofer/robotstxt

and even though robots.txt seems like a very, very simple text based protocoll there are unanswered mysteries

the biggest mystery is user agent groups and comments

i.e.:

  User-agent: googlebot
  User-agent: bing
  User-agent: yandex
  Disallow: /
so i am disallowing everything for google, bing, yandex; easy enough, but:

  User-agent: googlebot

  User-agent: yandex
  Disallow: / 
means that googlebot has no instructions, but yandex is disallowed all.

if a whole line is commented out

  User-agent: googlebot
  #User-agent: bing
  User-agent: yandex
  Disallow: / 
is a commented out line a blank line or a non existing line?

if it is interpreted as a blank linke, the disallow only counts for yandex, if the line is non existant, it counts for googlebot and yandex.

i like simple things, but sometimes the complexity is between the lines.


  User-agent: googlebot
  #User-agent: bing
  User-agent: yandex
  Disallow: /
why would it be interpreted as a blank line? If you remove everything after the #, that includes the new line characters at the end of the line. Leaving:

  User-agent: googlebot
  User-agent: yandex
  Disallow: /


here the problem comes in, as people write

  User-agent: googlebot
  User-agent: bing    #behaved-badly
  User-agent: yandex
  Disallow: /
if # removes everythin afer the #, then it would be

  User-agent: googlebot
  User-agent: bing    User-agent: yandex
  Disallow: /
which results into the whole line " User-agent: bing User-agent: yandex" beeing thrown out as malformed, so only googlebot would be disallowed.


It's ambiguous, and this is why significant whitespace can be so frustrating. Unless it's specified, some people will interpret a full-line comment as a blank line with a comment ending it (`^#.$`), and others will interpret it as you have (`^#.?\n`). Neither is obviously correct (even if it's obvious to you).

Edit: I don't know how to escape in HN formatting. Obviously there are italics where literal asterisks should be.


  *** You can just use three asterisks. ***
* You can just use three asterisks. *

   Unfortunately you need something after them though. ***
Unfortunately you need something after them though.


Thanks, that's good to know!


If you indent it two spaces, I think that should solve that problem:

   *(`^#.$`), and others will interpret it as you have (`^#.?\n`)*

(Not sure where you intended those asterisks. I made my best inference.)


Yeah I was hoping to keep them inline, but indenting surely would have helped. I intended them after the dots (`.*`, dot asterisk in case I get the escaping wrong, is 0 or more characters in regex).


What's the solution? Semantically it makes sense if it counts for both googlebot and yandex.

Have you observed one/both behaviours?


i am consulting a lot of aggregators, and we coose to interpret sometimes as this, sometimes in the other scenario. the way i interpret the spec, it is not defined behaviour.


> What if there is non ASCII data?

> Just use utf8 right? But wait…

> What if the program reading the CSV use an encoding depending on the locale?

> A program can’t magically know what encoding a file is using. Some will use an encoding depending on the locale of the machine.

Excel (for Mac at least) is a fucking pain in this regard. Just try this minimal UTF-8 example:

  tmpfile=$(mktemp /tmp/XXXXX.csv); echo '“,”' > $tmpfile; open -a 'Microsoft Excel' $tmpfile
Hooray, you successfully opened

  “,”
I'm not even sure in which encoding e280 9c2c e280 9d corresponds to that (not the usual suspect cp1252, nor any code page in the cp1250 to cp1258 range; easy to confirm with iconv).

One remedy is to add a BOM (U+FEFF) to the beginning of the file, but of course no one other than Microsoft (at least in my experience) uses this weird UTF-8 with BOM encoding (which the Unicode standard recommends against), so it breaks other programs correctly decoding UTF-8.

This means I can never share a non-ASCII CSV file with non-technical people. Always have to convert to .xlsx although it's usually easier for me to generate CSV. Then .xlsx opens me up to formatting problems, like phone numbers being treated as natural numbers and automatically displayed in scientific notation... Which means ssconvert or other naive conversion tools aren't enough, I need to use a library like xlsxwriter.

I'm not sure why it's so hard to just fucking ask when you don't know which encoding to use. (Plus it's not super hard to detect UTF-8. uchardet works just fine. Plus my locale is en_US.UTF-8, maybe that's a hint.)


> Hooray, you successfully opened

> “,”

> I'm not even sure in which encoding e280 9c2c e280 9d corresponds to that

That'll be MacRoman (not very surprisingly).

https://en.wikipedia.org/wiki/Mac_OS_Roman


Ah, I thought it has to be something Microsoft.

We can place the last modernization effort to this piece of code, then.


Yes, it’s a pain. However there is a solution to open an UTF-8 CSV file (with Data > From Text and a few clicks), but it’s true that googling something like that is out of the mind of most people.


I know. But imagine sending a file with patronizing instructions on how to open it, which will still be ignored — the other end will double click on the file, see garbage, and get back to you.


What about Libre Office ?


The problem is person at receiving end won't follow instructions for opening in Office, software they're moderately familiar with, but not enough to navigate a special open file process even with hand-holding.

I sincerely do not think the solution is instruct receiving person to install an even less familiar Office alternative.


Typically the biggest problem is saving from Excel to UTF-8 since it really really wants to just use its locale-dependent default charset. Opening is trouble too, but people are occasionally good about noticing that and figuring out how to get to the import options... the re-conversion to a different charset on save happens transparently so people don't notice.

In recent versions there finally seems to just be a new "UTF-8 CSV" option in the Save As dialog.


Implementing your own CSV parser is one of the first lessons a green software engineer will learn in how to NOT try to re-invent the wheel.

Still, please don't import packages with just one goddamned 3-line function in it.


It's what happens when DRY becomes dogma instead of a guideline...

This leads to debacles like the npm left-pad & kik affairs and other scary shit like https://github.com/parro-it/awesome-micro-npm-packages

Yeah - have fun maintaining dependencies when every other module depends on one-liners that can be pulled or broken at random...


One small tip if you receive a CSV with a non-default separator (like ;):

* Open it in a text exitor

* Add "sep=;" on the first line (without the quotes)

* Now at least Excel will open it as intended


Calc just asks upon opening, no hidden tricks needed. Whenever I see people struggling with Excel and CSV, also when pasting CSV data into Excel, I can't stop wondering why people still bother with the Office suite. It has zero advantages aside from the cost of change (time spent on re-learning where things are), yet we teach it to kids in school and send them reminders that Microsoft benevolently gives them very steep discounts to reduce piracy (never mentioning that you could also just use this other free thing...), never mind the legislation saying governments and schools should use free software unless there is some strong and documented reason to divert from the rule.


IIRC Excel will ask for the separator anyway when you open the file?


It does not and actually assumes locale-dependent separators (usually semicolons if your locale uses a comma as a decimal separator and comma otherwise).

If you want to specify separators you have to go via Text Import.


It does not, at least not anymore.


To be fair, sometimes calling some giant java library or installing tons of dependencies is just not realistic. Use the library for the mission critical stuff or the automated reporting without a doubt, but it's important to be practical and thinking on your feet as well.

In the meantime for quick analysis and testing 90% + can be accomplished with one line of (g)awk.

  awk -v FPAT='"[^"]*"|[^,]*' -v OFS='\t' '{$1=$1; print $0}' "$filename" 
More on FPAT [here](https://www.gnu.org/software/gawk/manual/gawk.html#Splitting...)


You can also use xsv for this kind of analysis of CSV files. Which is a unix-style tool designed for CSVs and with a proper CSV parser.

It's written in Rust so it's one binary - no runtime dependencies, and will happily chunk through multi-gigabyte files.

https://github.com/BurntSushi/xsv



It applies to everything, dates, floating point numbers, strings, etc. What we would consider the basics always turn out to be much deeper than they appear on the surface.


Dating advice: if you habitually, unironically, and without good cause use any dating format but that which is mandated by ISO 8601 we cannot be friends, let alone more.

https://twitter.com/jwiechers/status/1205515440543424513 + the thread


After working int he world of CAD tools for decades, I cannot tell you the number of times I've seen a younger engineer try to write a CSV parser only to fall into a circular hacking deathspiral.

You can't just use regex/split to handle CSV, unless you have significant field cleaning BEFORE converting to csv.

In reality you need lexical analysis and grammatical rules to parse any string of symbols. This is often always overlooked by naive implementations.

I take issue with OP's claim that RFC4180 is not well-defined, but almost all of the cases the OP listed are literally in the spec.


Whilst we're talking about libraries, had good experience with CsvHelper for C#/.NET.

https://joshclose.github.io/CsvHelper/

Documentation isn't 100% but it's a good tool. Been using it to work on a production 15+ column SQL->CSV mapping job with tens of thousands of records, working great.


I wrote a CSV-reading library 16 years ago that I'm pleasantly surprised handled most of the points of trickiness the article mentions.

https://www.neilvandyke.org/racket/csv-reading/

The documentation section "Reader Specs" gives a good idea of the different variations it intends to handle. I was working from a sense of variations I'd seen myself, and extrapolating.

One thing I also did, which I didn't see in the article, was to support comments.

Unfortunately, at the time I wrote it, there was no standard Unicode support for Scheme, so I didn't get into that, but just used the character abstraction (which might or might not involve parsing of a particular non-ASCII encoding).

Fortunately, AFAIK, it's worked for people, since fixing/extending it at this point would mean relearning the code. :)


Those are the simple things to worry about, it can even get more complicated in the real world. So complicated that writing our own parser was the only way to go.

So what are some of the more advanced challenges:

* Banking systems that treat CSV export just as some kind of line based dump file with no regard of consistent formatting. If the banking backend was updated, the format might change within a file from one line to the next

* Some misconfigured data dumping pipeline parses CSV the wrong way from another system and emits it in escaped form again in a different format. For instance putting a complete line escaped into a single field. Your parser has to detect that there is a CSV embedded within a CSV.

* Dumping Pipeline treats \r\n as all kinds of silly lines and re-embedding those lines with quotes around them to look like real data

* Inventing completely new specs of what CSV could mean

* Use mainframe character-sets from the last millennium


> Some countries use a comma as decimal separator instead of a colon.

And some companies... like Klarna, fixes this wrongly. At least they did a few years back.

We'd get a CSV file from them, containing payment information, but they ran into the issue that they really wanted to use comma for field separation, but also for krone/øre (decimal) separation. They ingenious solution: Fields are now separated by ", " that's a comma and a space.

Pretty much no CSV parser will accept a multi character field separator. So many fields would just have a space prepended to whatever value was in that field and you'd get a new field, when the parser split the field containing the amount. So now you have say 5 headers but each row would have 6 fields, because the money amount became two fields.

Being on the receiving end you now have to choose, do you want to strip spaces and reconstruct the amount field manually, or parse the CSV file yourself.


If you use .NET then CsvHelper[0] supports multi character separators.

[0] https://joshclose.github.io/CsvHelper/


I totally agree with this. Writing your own csv code is the archetypal example, in computer programming, of reinventing the flat tire.

I've found that every major language has at least one reusable component / module / library / dll to to this. They're maintained by people who have lots more patience than I do.


FYI: I've put together a collection of CSV specifications [1]. CSV Dialects / Flavors include CSV v1.0, CSV v1.1, CSV Strict, CSV <3 Numerics, CSV<3 JSON, CSV <3 YAML and some more.

[1]: https://github.com/csvspecs


Fantastic article!

It's missing: - "what if the quotes are smart quotes?" - This applies to both quotes inside a field and quotes wrapping a field - "what if the file doesn't have a header?" - "what if a logical record is spread over multiple lines?" - Yeah... so, you have line 23 that has the majority of the data in the expected columns, line 24 has additional data in columns 23, 25, and 28 that's supposed to be appended onto the previous line's data that also appears in columns 23, 25, and 28. Line 25, same thing, but line 26 starts a new logical record.

Seen all three above in the wild and Python's (pretty awesome) CSV code didn't handle them. Queue the custom code that wraps Python's core CSV parsing.


Would've been a lot nicer if they used the record and unit separator instead of commas and newlines. There are characters made specifically for storing tables like this and they instead chose to reuse characters which might appear in regular text.


If you can't write a correct (according to RFC 4180) CSV parser in 15 lines of Perl then you're not a Real Programmer!

(Yes, I understand that not all CSV files in the wild are correct according to RFC 4180. Yes, I also understand that only grey-beard loons still use Perl.)

The strangest thing about CSV, I think, is that if there's only one field per record and the file ends with a CRLF then you can't tell whether there's a final record containing an empty field following that CRLF. It's probably best to assume there isn't.


15? damn, that's 10 lines less than my awk csv parser. oh well...


I prefer to use a SQLite database. And if a customer insists on CSV then I still use a SQLite database to import and/or export the CSV without them realising it. SQLite brings sanity to CSVs.


You still have to import the CSV though, so the problems are still there, you just don't have to deal with them after a successful import.


You misunderstand how enterprise integrations work. It's easy to tell a customer: "your CSV is not compatible with standard off the shelf tooling, indeed the most popular database engine in the world" than "your CSV doesn't seem to work with our software/library/<proprietary ETL framework>". The former makes it their problem, the latter will almost undoubtably make it your problem. In case you don't know: SQLite has a really well designed and very performant CSV import function. BTW did you really downvote me for that? Wow.


Just today I needed to write a .csv file in Python on Windows and got a file with CR CR LF line endings by default (\r\r\n). That was using the standard 'import csv' in Python 3.


yeah you gotta pass

  with open(...,newline='') as f:


I think the docs now include this.


yes, they have for some time now i think. I wish it was better known, though.



Sometimes the best thing you do is learn at least one parsing framework so that you can write your own parsers. The advantage of this is you know exactly what the behavior will be.

Here's a CSV parser in ~40 lines of C++ using Boost Spirit that handles all kinds of weird cases:

https://gcc.godbolt.org/z/KGUHXo

It's easy to customize, declarative, and compiles down to about ~15KB.


I wrote my own CSV parser for https://www.easydatatransform.com. I ran into some cases where it wasn't clear what to do. For example where you had some whitespace outside of quotes. So I checked what Excel and Numbers (the Mac spreadsheet) do. They parse the same string differently...


doesn't seem insanely difficult

dont be scared of coding guys, just differentiate between environments: do it for your 4fun project, not prod.


The problem isn't really that it's difficult to solve. It's because it's impossible. There is no real formal spec. The RFC does not specify edge cases and nobody respects it anyways. "Generic" CSV parser is a pipe dream, don't try to do it.

Now, if you define your own set of rules on what is and is not allowed then it's fine.


Indeed it's a perfect example to illustrate the difference and the importance of an under-defined spec and a well defined one.

Whatever you don't define, will be undefined. But you still somehow have to do something definite without a definition.

The customer expects predictable output at the end, and until we have not only clairvoyance, but clairvoyance that can be built in a machine, you can't have predictable output without either predictable input, or a spec that actually provides an answer for any input.


Even when using libraries to read CSV data the need for sanitization isn't uncommon. Sure there's RFC4180, but variations are common. Mysterious records that are out of sync are typical.

For my own sanity I like to verify that columns are of the right data type where possible.

That said, I still prefer CSV over heavier formats (XML,JSON) where the data conforms.


Can we somehow start to use the ascii 30 character aka the Record Separator (SP)? This would solve a lot of the problems!


Not as fun as you might think. I had to parse and write inkjet control files for commercial printing. It used record separators, group separators, and unit separators.

The problem? Records began WITH the RS character and there were NO line breaks.

Ended up writing my own library https://github.com/Wombatpm/VIPNT


A good one I saw recently was a text file that, even though it had a .csv file extension was actually using another character as a delimiter (in this case colons) but as it contained large amounts of textual data actually had a enough commas to be parsed as a CSV file.

VS Code and a suitable extension handled this rather better than Excel.


Sounds like CSV in this case stood for Colon Separate Data


Further generalization: Character Separated Values


I write CSV code all the time in different languages. It's a fun challenge that incorporates several nuances. It has been a form of code kata for me over the last 20 years. Biggest lesson, one does not simply split a string. You have to scan through and tokenize following the state of a quoted field.


ndjson(new line delimited json) is what we should use as a human friendly and machine friendly alternative to csv.

json is a universal format. Works in every major language. It's spec is a state machine defining what to do with every possible encountered byte. https://www.json.org/json-en.html

SIMDJSON parses json at RAM speeds (~3Ghz).

CSV should die. It's a terrible format. Just send arrays of ndjson instead. You'll probably have a few more " and [, but hey, you can deeply nest arrays inside arrays of objects with arrays.

https://github.com/simdjson/simdjson


JSON doesn’t handle int64 or explicit date-time data types, which makes it nearly as painful as CSV in practice when used for data integration. Both of those are explicit types present in nearly every real-world database schema.

So you end up with many of the same problems as CSV. You basically get an application-specific JSON dialect to interpret.

Header row or not? Is a long series of digits in quotes an int64? What happens when someone puts an ISO-formatted date string into a field that is supposed to be a simple string?


Some fast implementations in C++?

In particular, I tried in the last weeks to work on a OpenMP / MPI driven CSV parser, but surely accounting for new lines inside quotes can be a pain the ass, I'm wondering if there's a go-to implementation or model to implement that.


I did this just for fun to learn a parsing library I was using. It didn't deal with utf-8 though, but even that is manageable. The complexity with csv is that it is a family of formats with small changes. But if you make your parser configurable, you are set.


If you want to do something really hard, try to parse PDF, that is sheer insanity.


CSV is not a portable format across applications. That's it. You should use it only in your own application or in connection with applications that you know. The big problem is to think that you can handle a generic CSV file, because there is no such thing.


Just because all kinds of inputs exist, doesn't mean you should support handling them all.


What about implementing something that works for your environment and making it fail-fast so that any unexpected input will just abort+log the current operation? You could then expand on that on-demand.


This assumes you can recognize unexpected input.


I assume unexpected input results in either an error (or exception, whatever your language calls it), a crash or a user complaint. I also assume it's unlikely for an input parser to introduce a security issue into your app if it's written in a safe language and your app has proper validation set up (i.e. don't trust input as soon as it passes the parser).


I don't want to but there is nothing that I've found in c# other than a helper tool that will only translate objects to CSV rows.


I'm currently implementing a REST API that returns CSV files.


Me too. Everything was great until Mr O’Connor came along.


>Some countries use a comma as decimal separator instead of a colon.

I kind of wish we could all just pick something and stick with it.


Yes, and we picked a comma. You?


That sentence must have a typo in it, right?

Wikipedia [1] doesn't list any country that does use the colon (:) as the decimal separator, so possibly the period/dot (.) as used in most English-speaking countries, was meant?

1: https://en.wikipedia.org/wiki/Decimal_separator


Yeah, I think what author meant "some countries don't use the period but a comma, thus hilarity ensues" (especially when your code respects the locale and the csv file is written in append mode)


Like the metric system?


The metric system, being French in origin, would certainly use the comma!


i wanted to... but not anymore :/


import csv


I disagree with the article, I'd say writing your own CSV code is the second best option, behind not using CSV at all.

The reasons are:

- CSV is really simple. A complete RFC4180 parser shouldn't take more than 100 lines of readable code to implement. Even less so for the writer. So much that most of the code is likely to be interface code (file I/O, matching your internal data structures, etc...) rather than parsing. And chances are that a library won't even make your code shorter.

- As mentioned in the article, CSV is not well defined. I've seen all sorts of weirdness: semicolons vs colons, LF vs CRLF, simple vs double quotes, backslash escaping, comments, floating point numbers with comas instead of points. What your library understands as a CSV may not be the files you have to work with. I'd rather have a simple piece of code that correspond to what I am given rather than a big library that uses AI techniques to try to guess the intent of whatever wrote this file.

There are cases where it would be foolish not to use a library. I will never write a XML parser for instance, it is a well defined and complex format. JSON is borderline but I still won't write my own parser. CSV is essentially a proprietary format, so it is fitting to use a proprietary parser/writer.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: