So you want to write your own CSV code (2014)

ken · on May 15, 2020

I wrote my own CSV code. That's only about 10 issues, and I understand them all, and they're easy to handle, and to test (unlike, say, threading issues, or OS bugs that don't happen on my computer).

That wasn't even close to being one of the more difficult parts of my program to write. I've had to do zero work to maintain it. I don't seem to have ever had any bugs filed against it. Are these supposed to be difficult? Sure, don't write your own encryption algorithm, and avoid touching threads if at all possible, but CSV?

Sure, I wish I could have used a library. Unfortunately, once you start adding requirements, the options start to vanish. It needs to be a C (or C-callable) library. It needs to accept input as Unicode strings (lines), not just files on disk, or bytes. It needs to return results line-at-a-time. It needs to have a compatible license. It needs to support the same package manager as I'm using (or be a trivial to include, like a single file). It needs to be reliable (essentially bug-free, or at the very least currently maintained). It needs to be fairly efficient. What library supports all these?

> Writing CSV code that works with files out there in the real world is a difficult task. The rabbit hole goes deep. Ruby CSV library is 2321 lines.

I love Ruby, but that file looks a little nuts to me right now. The block comment at the top of csv.rb is over 500 lines long. The Ruby module tries to do everything. Just "def open(filename, mode="r", options)" and its docstring are 100 lines long! This class includes methods for auto-detecting the encoding, and auto-detecting common date formats, and some kind of system "for building Unix-like filters for CSV data".

I don't put all of that into my CSV parser, because I want my code to be modular. I have features like data detectors and filters and open-file-by-filename in my program, too, but they're fully independent, not tied to any one file format. When I come up with a better way to detect dates in text, I don't want to tweak a 6-line regex in every file format I've written.

rovolo · on May 15, 2020

1) You wrote a general CSV input parser, but what about someone writing CSV output? The obvious way (write a ',' between each cell and a '\n' between each row) works most of the time, but it doesn't always work. The output needs to be escaped correctly, and data escaping should be well tested. I think the issue is that CSV sounds very easy to implement, but most naive implementations will be wrong.

2) You took on exactly the responsibilities the article author said you'd have to take on. Based on the screenshots for your "Strukt" application, it looks like the user controls configuration/validation for parsing CSV and they will fix any issues with parsing the CSV file themselves.

> If a supplied CSV is arbitrary, the only real way to make sure the data is correct is for an user to check it and eventually specify the delimiter, quoting rule,… Barring that you may end up with a error or worse silently corrupted data.

3) I think the better question is, how do you handle malformed CSV? Is your error correction as robust as Ruby's? Do you handle malformed CSV the same way as other libraries? If it's handled differently by different parsers, you can run into issues like Apple recently did [0]

[0] https://siguza.github.io/psychicpaper/

ken · on May 15, 2020

> You took on exactly the responsibilities the article author said you'd have to take on.

Yep, and having been there, I’m reporting: it’s not nearly as bad as this post makes it sound.

I’ve written web apps, too, and (for example) CSS is 100 times worse. Nobody is suggesting every website should use only browser default styles, though.

> I think the better question is, how do you handle malformed CSV?

What does “malformed” mean? The good/bad thing about CSV is that virtually every text file is valid! The only malformation I can think of is an open quote with no matching close quote (so the entire rest of the document is one value). My implementation is streaming, so there’s no great way to flag it: any future data could have the other quote!

We show the data as it’s parsed, so it should be obvious to the user what is going on, and where.

There are areas of computing that are too complex. I’m usually the first to complain about such things. I really don’t think CSV parsing is one of them.

sheetjs · on May 15, 2020

It's even worse than the author is suggesting. For most people, "RFC4180" is meaningless, all that matters is what Excel does. And that means you need to handle a bunch of cases if you are reading AND if you are writing files. A few cases not discussed in the blog post:

- if your file starts with \x49\x44 ("ID"), Excel will interpret the file as their symbolic link .SLK format. So if you're writing files, the ID should be wrapped in double quotes even if it isn't necessary according to RFC4180

- Excel will proactively try to "evaluate" fields that start with \x3d ("="). You can see this in action with the sample file

    1,2,3
    =SUM(A1:C1)

- Excel will aggressively interpret values as dates when possible. For example, SEPT1 issues https://genomebiology.biomedcentral.com/articles/10.1186/s13...

CSV parsing / writing certainly isn't going to be a value driver for most companies (if you're supporting user imports, you really care about XLSX/XLSB/XLS files and Google Sheets import), but it's not a trivial problem.

btilly · on May 15, 2020

"What does “malformed” mean? The good/bad thing about CSV is that virtually every text file is valid! The only malformation I can think of is an open quote with no matching close quote (so the entire rest of the document is one value). My implementation is streaming, so there’s no great way to flag it: any future data could have the other quote!*

An example of improper csv code is this:

hello,world,"And this, contains "bad quotes"...",1,2,3

Lest you think this is made up, I ran across this when someone cut and paste Excel into a text field.

I also have seen batch processing of user files break hard when a quote issue like this caused a hand-rolled CSV parser to conclude that half the file was a single very long field.

rovolo · on May 15, 2020

I think you have one of the best use cases for rolling your own parser. Your tool's purpose is to read and parse arbitrary data (and then transform and display it). But, I think you're misinterpreting how much work the post is saying CSV will take to implement:

> Easy right? You can write the code yourself in just a few lines.

My take-away from the post is: if you are parsing arbitrary CSV files, you need to make parsing configurable because there's no one, true CSV format. If you are writing CSV files, you may need to escape your fields in a weird, outdated manner.

P.S. By "malformed", I meant whether the 2D matrix of byte arrays is read exactly as intended. It could be caused by an open quote, but it could be incorrect escaping or inconsistent delimiters. Since there's no inline schema saying which CSV parsing configuration is being used, you must ask the user to configure the CSV parser and validate the output.

greggman3 · on May 15, 2020

> I don't put all of that into my CSV parser,

The art of not cluttering a library with too much stuff is a hard lesson to acquire. Most library authors try to throw in the kitchen sink. It's insane. I'm sure I was guilty of it in the past.

For parser take a string. let me do the I/O outside. If you want to offer streams then just offer an abstract interface stream and then a few examples of implementing a stream but don't include them in your library. It's just clutter.

This is one if largest problems with NPM. So many libraries try to do too much. Some come with a command line utility "just because"!??!?! It then adds more surface area, more dependencies, more churn as we have to update all those un-needed dependencies.

mikepurvis · on May 15, 2020

Isn't NPM also notorious for ultra minimalist libraries like leftpad as well, though?

tehbeard · on May 15, 2020

Leftpad is more a symptom of JS lacking a real standard library than anything else.

seabass · on May 15, 2020

Yes. One of the costs of having a low barrier to entry for publishing is that the quality is generally lower. NPM still hosts tons of very well written code, but you have to sift through quite a lot of noise to find it.

burntsushi · on May 15, 2020

> Sure, I wish I could have used a library. Unfortunately, once you start adding requirements, the options start to vanish. It needs to be a C (or C-callable) library. It needs to accept input as Unicode strings (lines), not just files on disk, or bytes. It needs to return results line-at-a-time. It needs to have a compatible license. It needs to support the same package manager as I'm using (or be a trivial to include, like a single file). It needs to be reliable (essentially bug-free, or at the very least currently maintained). It needs to be fairly efficient. What library supports all these?

I always thought people just used libcsv in cases like this. If that's missing something though, then I believe my Rust csv library[1] would be good enough if I added a C API, which would be pretty easy and I would be happy to do it if it really filled a niche. I think the only hurdle at that point would be whether you could tolerate a Rust compiler in your build dependencies.

[1] - https://docs.rs/csv-core/0.1.10/csv_core/

geraldbauer · on May 15, 2020

> I love Ruby, but that file looks a little nuts to me right now. The block comment at the top of csv.rb is over 500 lines long. The Ruby module tries to do everything...

FYI: CsvReader [1] is a more modern alternative to the old standard csv.rb library in ruby.

[1]: https://github.com/csvreader/csvreader

mey · on May 15, 2020

I wish the JVM was less of a pariah. I love being able to just reach for https://commons.apache.org/proper/commons-csv/user-guide.htm...

Here is the entire logic of the CSV parser (test code is in a separate path) https://gitbox.apache.org/repos/asf?p=commons-csv.git;a=tree...

ken · on May 15, 2020

I'm not as anti-JVM as I once was, though the recent licensing changes have left me confused and nervous. I read their licensing FAQ (again) and I still can't tell if I'm allowed to include recent versions of their JVM in my application.

Were I to use the JVM for my background processing needs (the thought has occurred to me!), I'd definitely use Clojure -- 'data.csv' is less than 150 lines, including comments, for both the parser and formatter!

mey · on May 15, 2020

I suggest looking into very specifically AdoptOpenJDK.

https://adoptopenjdk.net/faq.html

A lot of people have been digging into the general insanity of Java licensing since Java 8's divergence. Here is a good overview https://medium.com/@javachampions/java-is-still-free-2-0-0-6...

the_af · on May 15, 2020

In what sense is the JVM a pariah? It's one of the most successful software platforms.

mey · on May 15, 2020

- Oracle's licensing confusion (where does Java and Java(TM) begin/end).

- The JVM historically has included the kitchen sink (Java 9 started addressing this).

- People's impressions that it is slow or a memory hog. Edit: (Looking at you Eclipse/Intellij/Glassfish/JBoss/Atlassian)

- Experiences with bad code bases due to it's ubiquity and age. Old code bases tend to evolve in interesting ways.

- It's not the new shiny thing.

I have meet plenty of developers who dismiss Java out of hand. I've also seen DBA's dismiss FKs. I don't think how good/bad it is has anything to do with people's dislike of it, more that there is a certain vocal group that dislike it for being the "enterprise" solution.

the_af · on May 15, 2020

But this does not correlate to how used the JVM actually is in the outside world. The JVM is huge. It's not a pariah. The perceptions of JVM slowness have been debunked again and again, and anyone who cares knows they are mostly a red herring and have been so for years.

Java and the JVM are mature, battle-tested, widely used technologies in the industry, easy to hire for and with multitude of libraries for almost whatever you want to do.

It may not be the latest shiny new thing, but that's different. Pariah means an outcast. Neither Java nor the JVM are outcasts.

rmrfrmrf · on May 15, 2020

having been bit by both reinventing the wheel and overly-relying on third party code, i find that my position on this varies from week to week.

chipotle_coyote · on May 15, 2020

As I find myself writing my own PHP quasi-framework from scratch in 2020, I am definitely feeling this. On the one hand, reinventing a wheel like this is utterly insane. On the other hand, for the simple project I'm doing this for, nearly all the other wheels I've found turn out to be monster truck tires, milspec tank treads, or waterskis.

heavenlyblue · on May 15, 2020

I am betting that Ruby library still can’t handle a good 50% of the files that people throw at it because there’s not such thing as CSV standard.

On the other hand writing a custom parser using some off-the-shelf parser combinator library is easy and straightforward (ie I know how much time it will take to implement one and what exactly it does).

Sometimes having multiple implementations of the same is better than a single generalised implementation struggling from combinatorial explosion.

goto11 · on May 16, 2020

True, is not rocket science to write a CSV parser. The problem with CSV is it looks so deceptively simple so people write simplistic ad-hoc parsers/exporters all the time, so we end up with a bunch of incompatibilities.

platz · on May 15, 2020

There is a big difference between writing an application and being a library author.

cesaref · on May 15, 2020

One thing that has always puzzled me is why CSV exists, given the ASCII unit/record seperators appear to do exactly what the comma and newline characters are attempting to do with codes which could appear within records, and have been well defined since some point in the 60s.

And here we are, 60 years later still struggling to work out where a record ends...

atoav · on May 15, 2020

Easy: CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard, while they don't have the ASCII record seperator visible (teaching them to use eg Alt + 30 will not work).

And if they had that key on their keyboard then you'd have the comma problems all over again: What if a ASCII record seperator shows up in the field?

r00fus · on May 15, 2020

If the ASCII separators were supported regularly in text editors since the early days then it would remove 90% of the problems.

I prefer the | pipe character as a delimiter - easy to see, not part of common speech or titles, and enterable via keyboard. Yes, it can exist in the field but less likely.

niccl · on May 15, 2020

But if it _might_ turn up, you have to assume it will and deal with it, and then you're into the exact same problem as before, but with something even less 'standard' than before.

I think it's like cryptography. Why bother to roll your own when there are people who are cleverer (certainly than me. I don't know about you) who've already put a lot of effort in to this, so just use one of the well tested standard libraries and don't mess with it

r00fus · on May 15, 2020

I'm not recommending "rolling your own". You still use CSV libraries, but with a pipe delimiter instead.

chapium · on May 15, 2020

It looks a LOT like l and I.

r00fus · on May 15, 2020

Luckily code sees it as ASCII 124. If you're saying that for compliance, | can be converted to l/I then it's a feature, not a downside.

throwaway_pdp09 · on May 15, 2020

> What if a ASCII record seperator shows up

point is, being not human readable and having no keyboard key, you can reasonably expect those special separators not to.

I suppose I'd consider such special-separated files (26 to 29 from memory?) to be machine generated and machine readable only, not intended as human readable without a bit of extra software or eg. a special emacs mode

roenxi · on May 15, 2020

You're almost certainly describing a file format that already exists and that nobody uses because it is not human-interactable.

Anyone can edit a csv. I often have. An important feature.

The only real stumper in the CSV format is why they didn't use \ as the escape in strings like everyone else. Probably some good reason.

eyelidlessness · on May 15, 2020

While backslash escaping is very common, double-quote (or double-special-anything) is fairly common as well. I use it all the time in SQL.

throwaway_pdp09 · on May 15, 2020

Interesting. I don't think I have ever hand edited a CSV (unless you mean created from scratch, well then, perhaps).

OTOH the ability to just look at it... that I've found very valuable and agree there.

sokoloff · on May 15, 2020

When considering standard A that can only work with machine generated and machine readable files and standard B that can work with all those files plus human-readable/editable files, I don’t find it surprising at all that standard B wins in the market.

ken · on May 15, 2020

It happens. XPM/PBM/XBM lost to TIFF (and later PNG). Postscript lost to PDF. OpenOffice.org XML couldn't displace Excel's Binary File Format.

As much as I wish that beauty or usability was the primary indicator of market success, the simpler explanation that explains all of these (including CSV) is: Microsoft Office supported them.

throwaway_pdp09 · on May 15, 2020

I don't know about XPM/PBM/XBM lost but AIUI PDF is just a subset of postscript with compression. If you think MS's binary excel file format is all that secret, I suggest you try unzipping it - it is no more than compressed XML. I have rescued parts of corrupt excel and word files by doing just that.

mark-r · on May 15, 2020

MS completely redesigned their binary formats a few years ago. You're right about the current format, but the earlier one was completely undecipherable. Even the current XML has quirks in it to be compatible with the older format.

mike22 · on May 17, 2020

It was reversible enough to write the Apache POI library. I maintain code that uses the HSSF streaming API portion of the library, which mainly gives you low-level access to the structures in XLS files. The OO.org folks made some very nice documentation: http://www.openoffice.org/sc/excelfileformat.pdf

ken · on May 15, 2020

PDF diverged from Postscript years ago.

Microsoft’s binary Excel format is not XML -- you’re thinking of its replacement. It was COM/OLE or some such.

greggman3 · on May 15, 2020

Binary formats win I think when the ASCII format is significantly greater in storage size and parsing speed. XPM/PBM/XBM are enormous compared to any binary image format. And remember these images used to get put on 1.4meg floppies and sent over 2400baud modems.

It's basically use the right tool for the job. For pixels that's a binary format. For mostly text, text itself works pretty well.

throwaway_pdp09 · on May 15, 2020

There is no standard but one for CSV files (the RFC 4180 mentioned).

The whole article pretty well rams home the fact there is no standard B because there is no standard which is adhered to.

sokoloff · on May 15, 2020

Humans are not bound by the computer science definition of “standard”.

But you’re correct that I was using the word “standard” imprecisely to mean something more like “file extension” rather than an IETF RFC.

ISL · on May 15, 2020

If someone copy-pastes a line from one CSV to a string within a field of another, the second will have record separators from the first within a single field.

u801e · on May 15, 2020

> CSVs are human readable and writeable and humans tend to have a comma sign on their keyboard

Do people usually compose CSV data by hand? I thought they would use a program like Excel, enter or generate their data in separate cells and then save the file as a CSV. There's no reason why a program like Excel (or any other program for that matter) couldn't use the record separator instead of a comma as a delimiter when generating or consuming such files.

BiteCode_dev · on May 15, 2020

Compose, rarely. Read and edit with plain text editor, it's common.

mywittyname · on May 15, 2020

I did it like three times already today.

u801e · on May 15, 2020

What editor did you use?

mywittyname · on May 15, 2020

vim / sublime text, depending on context.

I do a lot of troubleshooting using CLI tools like grep and wc.

ganstyles · on May 15, 2020

I do quite a bit for GitHub markdown tables, for making small annotations for ML, for python parsing, etc. Very common and would be annoying too open up excel every time I needed some csv.

u801e · on May 15, 2020

I thought markdown tables used the pipe character as the delimiter between table columns.

Also, is not necessary to use excel, and a lot of the issues brought up in the article wouldn't be the case with a different delimiter character. Editors could easily be updated to make entering that delimiter character easier to enter by mapping it to a key like tab whenever a file like that is opened.

It just seems like a problem that could be solved, but can't be due to inertia.

yiyus · on May 15, 2020

> I thought markdown tables used the pipe character as the delimiter between table columns.

Markdown (the original one by John Gruber) does not include any syntax for tables. Other implementations have included it, but there is no standard. As far as I know, using the pipe character to separate columns is indeed quite common, but it is not the only way.

Kluny · on May 15, 2020

In my experience, they'll have a database that they administrate through excel, then output it as CSV for whatever tool that needs to process it.

louis_pasteur · on May 15, 2020

But humans rarely use notepad these days. They use code editors like Coda or VSCode at the very least which have all kinds of advanced features. Surely, those can include support for ASCII separators?

sfifs · on May 15, 2020

Let's say we have a product master. Something looks fishy and we want a brand person or product supply person to check... Push a CSV and it's trivially easy for them to open in XL and check, make corrections and send.

In a business context, this happens far more often than you may expect. Sure you can build a custom platform to validate and make people connect to a form connecting to a database - but Excel is a great user interface and everyone knows excel.

A funny incident - we were struggling to build a complex workflow in KNIME where at some points we need user input. Nothing out of the box was great - tools either assume a dashboard paradigm or a data flow paradigm - nothing In between.

One of our creative folks came up with the solution of writing to a CSV and getting KNIME to open it in excel. The user would make changes and save, close excel and continue the workflow in KNIME. Even completely non technical people got it.

hnick · on May 15, 2020

We had to ban our non-technical users from opening CSVs in Excel because it would strip leading zeros and do other exciting things like convert large numbers to 1.23123E+20 :(

It does this without any notification or warning when you save.

Tostino · on May 15, 2020

Kill me...i've dealt with flat file ETL for ~9 years now, with the first 2 pretty much dedicated to it. I hate hate hate Excel for no reason other than that. It has wasted so many hours due to customers messing up their identifiers by opening up their feeds in Excel prior to sending, and accidentally transforming like above, then having to fix changing customer identifiers used throughout the system once they figure out what happened.

Semaphor · on May 15, 2020

Excel can only display CSV files. I had to ban myself from editing files in Excel because it’s atrocious at it. It also likes to somewhat change decimal-seperators (again, without any warning).

Someone · on May 15, 2020

Even technical users have trouble round tripping data to Excel.

https://genomebiology.biomedcentral.com/articles/10.1186/s13...:

“A programmatic scan of leading genomics journals reveals that approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions.”

That paper is from 2016, at least 12 years after the problem was first identified.

mark-r · on May 15, 2020

East coast zip codes with leading zeros are a real problem.

mywittyname · on May 15, 2020

That behavior is technically correct because those are all valid representations of that value, e.g., 007449938 = 7449938 = 7.449938E6. Pretty much any plain text (or binary) data format will have this same issue, unless it supports additional metadata for display purposes.

If you intended the data to be a string value, then it should have been enclosed in quotes.

yiyus · on May 15, 2020

The problem is that when you have a number like 123451234512345, it may get rewritten as 1.234512E+14, and then you are loosing information.

hnick · on May 16, 2020

Your assertion sounded incorrect so I tested it. Quoting large numbers does not preserve them on a save.

Adding a ' before it should work, but for data extracted from arbitrary systems you'll have few guarantees of the sort. Formatting the cells in Excel to display the full value instead of the truncated value also works from memory, but you won't always know ahead of time if this happened later in the file.

In our case it was often mailing identifier barcodes so any loss of precision made them entirely useless.

rovolo · on May 15, 2020

What if the value is an ID rather than an integer? CSV doesn't specify whether the data type is a string or a number. The schema needs to be inferred by the parser/user.

recursive · on May 15, 2020

> If you intended the data to be a string value, then it should have been enclosed in quotes.

That's not part of any CSV specification I've seen, including RFC4180.

SifJar · on May 15, 2020

but if using the ASCII separators was common, excel would support that and it seems like this workflow would be unaffected?

paulryanrogers · on May 15, 2020

People still need to open them in non-excel apps. And the format may predate widespread use of spreadsheet software.

Fun fact, Excel will truncate numbers beyond 15 digits unless prefixed with a single quote mark.

the_angry_angel · on May 15, 2020

> But humans rarely use notepad these days. They use code editors like Coda or VSCode at the very least which have all kinds of advanced features. Surely, those can include support for ASCII separators?

I do a fair amount of work with companies that do "EDI" over CSV (or worse CSV-like - think 2 CSVs jammed together with different formats, no headers, no support for escaping or quoting) and fixed width documents. I can absolutely assure you that humans do open these files in notepad far more often than I'd like.

Often one of the main reasons they don't use things like X12, ASCII separators, etc. is because a "human needs to open it at some point" was a prevailing business decision some number of years ago (think "what happens if the IT system fails? how can we still ship stuff even in a complete emergency") and now it's baked into their documented process so deeply its like shouting into the wind to alter things. Third party warehouses are the worst at this.

roto · on May 15, 2020

Developers are by no means the only humans that use and/or edit CSV files.

beaconstudios · on May 15, 2020

Developers rarely use notepad. Non technical users still have to edit csvs by hand, all over the world.

C1sc0cat · on May 15, 2020

The average CSV use is as an export/import from excel and most of those users don't have specialised developer editors.

kube-system · on May 15, 2020

I have not seen very many business analysts open an IDE. In my experience, most edit CSVs in Excel.

JauntyHatAngle · on May 15, 2020

I don't think your first line is at all true.

Modifying Csvs with notepad is rife. I'd wager more than using any one particular code editor.

arethuza · on May 15, 2020

Or they use Excel, which can introduce its own entertaining errors.

A few years back while working on something that unavoidably used large quantities of CSV data we would sternly warn people not to use Excel, but people still did it ("But I just looked at it!").

snazz · on May 15, 2020

The TSV (tab-separated values) format is supported by a lot of software (Google Sheets, for instance) and solves the "comma in the field" problem quite nicely. Tabs can easily be typed by a human, but they don't usually occur within a one-line string, so they don't cause the same problems as CSVs do.

hulahoof · on May 15, 2020

TSV, CSV, PSV etc. all share the same class of problem. The important thing is that you are quoting your fields and escaping your quotes, regardless of the symbols used.

JoBrad · on May 15, 2020

Importantly, tabs aren’t usually inserted into fields by Excel users. With Power BI, exporting TSV is a little easier for non-dev Excel power users, too.

IggleSniggle · on May 15, 2020

Nailed it. When Excel users press tab, they mean "go to the next cell." Conveniently, this is also what the format means to the computer when it encounters the tab separator. It's not foolproof, but it's much better. The human interpretation of the character matches the machine interpretation.

DataWorker · on May 15, 2020

“Don’t usually occur”

They do. People write code to create all sorts of bastardized abominations of “csv” or tab delim or whatever. It’s why the featured article gets reposted every few years. You can define a standard for csv files, but then Excel does it’s own thing and here we are.

barbegal · on May 15, 2020

I think that stems from the fact that no software ever properly supported them, they are non-printable characters so you can't see them easily. Whereas with CSV you can fire up notepad.exe and explore the data.

Given the inability to standardise on line endings "\n" "\r\n" or "\r" I don't think we can standardise on using unit or record separators

tinus_hn · on May 15, 2020

No, we certainly can’t expect people to use the ASCII ‘Record separator’ character as a record separator, that would be madness!

mark-r · on May 15, 2020

Have you ever seen the ASCII separator characters used as they were intended? I don't think I have. It's obvious the problem they were trying to solve, but it was too little too late. It doesn't help that they're control characters that aren't meant to be displayed so they're practically invisible.

atrettel · on May 15, 2020

CSV is human readable and writable, but that is only part of the advantage of CSV. Its simple format is great as an archival format. It's the plain text of data formats. Decades pass and you can still read it without much of an issue, though people are rightly pointing out that the format is ambiguous is nature and subject to a lot of interpretation. I've recently been compiling a lot of decades old experimental datasets, and it's honestly great when I find plain text files from the 1980s, since we can still read them! It's the more exotic data formats that came about later that are often more difficult to read properly.

The Library of Congress lists CSV files as one of its preferred formats for archival datasets: https://www.loc.gov/preservation/resources/rfs/data.html

skrebbel · on May 15, 2020

Bacuse csv is human readable and human writeable

nojito · on May 15, 2020

So is tsv, but for some reason csv is more popular despite being the worst character to use.

hrktb · on May 15, 2020

For a lot of people, meaningful space is a hard concept to understand.

It's made worse by the number of applications where they will see spaces reformatted willy nilly (for instance web forms eating their line breaks.

From there explaining there are spaces and tabs, and that both can look the same on screen but they are different is just asking for trouble outside of our circles.

mgkimsal · on May 15, 2020

> is just asking for trouble outside of our circles

it's asking for trouble in our circles too. Imagine getting a tab-sep file, opening it up in an editor and having it automatically convert to 4 spaces "because", then sending it back without checking something that doesn't look wrong.

SAI_Peregrinus · on May 15, 2020

Or an input/editor where the `tab` key moves your focus from the text field to the next button. Like Slack. You can't just hit `tab` in Slack when typing a code snippet for something that needs them like a Makefile.

JamesSwift · on May 15, 2020

What is human readable about a tab?

nojito · on May 15, 2020

Open up a csv in notepad vs opening up a tsv.

One actually looks like a table the other is a garbled mess.

JamesSwift · on May 15, 2020

I think we have different definitions of human readable then. To me, human readable means if I print it in a printer, it loses no fidelity. What you describe is what I would call machine-readable. It is able to be imported into a program. The same that a binary separator would also let it be imported.

nojito · on May 15, 2020

Even in your example TSV wins.

steerablesafe · on May 15, 2020

If those characters had a visual representation and were easily typeable and we settled on using them as separators then we would struggle escaping those special characters within records. The problem is lack of standardization, not lack of special characters for record separators.

cryptonector · on May 15, 2020

ASCII FS/RS/GS/US do not have a visual representation, and users do not know how to enter them. That means they're essentially for binary-like encodings -- they're not really text if users can't see or enter them easily.

But users can see the commas in CSV, and they can trivially enter them. Yeah, the result is messy.

The lesson here is that the separator control characters should have been visual and had a visual indicator on keyboard keycaps to indicate how to enter them. Because they aren't and didn't, they are essentially useless for text.

EDIT: I do happen to know how to enter these on *nix systems. The ascii(7) man page tells you how:

       034   28    1C    FS  (file separator)        134   92    5C    \  '\\'
       035   29    1D    GS  (group separator)       135   93    5D    ]
       036   30    1E    RS  (record separator)      136   94    5E    ^
       037   31    1F    US  (unit separator)        137   95    5F    _

so FS is ^\ (which you have to be careful does not cause the tty to generate SIGQUIT -- you have to ^V^\), ^], ^^, and ^_. That is, <control>\, <control>], <control>^, and <control><shift>-. On U.S. keyboards anyways.

billme · on May 15, 2020

Simple Google explains why:

- non printable character - no single keyboard key - intended for systems, not people - etc

luckylion · on May 15, 2020

Yeah, we've never had similar issues with line-endings ;)

Spooky23 · on May 15, 2020

I think CSV is more popular than one would think because Excel was awful back in the old days.

A lot of the crufty edge cases are artifacts of whatever people had to deal with in 1995 to get data into spreadsheets. We all got used to pushing data around with CSV because the people consuming the data needed it.

hexmiles · on May 15, 2020

That's funny, when i was twelve i need to store some data and i didn't really know database or csv, so i actually used the ascii control character to implement tables and record. I later discovered sqlite, soooo much simpler...

carapace · on May 15, 2020

This, a thousand times.

As a fun exercise write a CSV-like spec but using those ASCII chars.

    table = [row.split(unit_sep) for row in data.split(rec_sep)]

There's your ASCIISV decoder.

eyelidlessness · on May 15, 2020

Based on my understanding of CSV, unit separators are valid column delimiters. I doubt you could use record separators between as row delimiters though.

rb808 · on May 15, 2020

Wow never heard of it before. The biggest problem is Excel and text editors rule the small data world.

QuadrupleA · on May 15, 2020

This scary stuff only applies if you need to read garbage CSVs from a million different sources. Even then, like ken says here, that's only like 10 issues to figure out.

Exporting is likely the more common case and is pretty simple. Quote fields if they contain \r\n",

I'd say relax, focus on your particular problem - if a simple 2-3 line export loop fits your use case and is simpler than a csv library (e.g. in C/C++ perhaps where dependencies are a pain) then why not.

This fear mongering about "you can't possibly do this, the library writers are much smarter than you" can lead to it's own ridiculousness of hundreds of dependencies and ungodly waste of processor time, memory, build time, containerization hassle, etc. to solve what should be a simple problem.

crispyambulance · on May 15, 2020

> This scary stuff only applies if you need to read garbage CSVs from a million different sources. Even then, like ken says here, that's only like 10 issues to figure out.

Well, garbage CSV's are garbage CSV's. They can be inconsistent, and if you're reading millions of them (OK, even hundreds) it can be a real pain-in-the-ass data janitorial task-- the equivalent of cleaning a public bathroom after homeless people have sprayed diarrhea all over the floors, walls, and ceiling.

The hard part of dealing with the "10 issues" is you have to figure out which ones apply to which files (or even lines). A CSV linter is often helpful to diagnose, that would be the janitorial equivalent of rubber gloves and a bucket of soap, but it's still a lot of work.

If you truly have millions of inconsistent files to deal with, the only way to keep sane is to categorize them by some parameters that make sense... size, date, keywords, even word-stats. Then, you can programmatically tackle each category. It's a messy problem no matter what and gets more ugly as the scale increases. And all of this is often before you even start what you actually need to do.

7thaccount · on May 15, 2020

I agree. Sometimes the business case is needing to read .CSV files that have some pretty bizarre properties and you need a massive library that takes care of all of that.

In my case (I do a LOT of .CSV work) though none of these problems exist. I haven't reached for Python's .CSV library in years and neither have my co-workers. We simply loop through the file, split the strings on commas, and have a few if statements to parse on a situation by situation basis. Extremely old school, but it works very well and is easier than dealing with the objects that are generated from using the library. I realize this probably doesn't work with all use cases on HN though, or even most.

My ideal language has direct support for making file I/O, dictionaries, sets, whatever all in core and easy to use. This keeps me from having to write my own helper classes and modules or bring in 3rd party libraries. Other languages like to have a community package for everything to where you have 9 options for everything. This seems to be common in JS and Perl and is certainly valid, but I'm not a big fan. Python hits a sweet spot for me.

kristopolous · on May 15, 2020

There's a balance. Massaging wacky csv feeds to actually be parsable because someone wrote their own incorrectly is a real problem. Seen it dozens of times.

I think the real problem is programming has become too easy. Almost nobody really has to worry about data structures, memory management, any of the hard problems that filter for competency.

Some wacky insane stuff runs just fine on these 8-core multi-gigabyte ram computers we literally have in our pockets. A lot of it is on a VM inside of yet another VM, even that's fine. Just about every bad idea seems to work.

These kinds of issues are to be expected.

bob1029 · on May 15, 2020

This is my take as well. Most of our CSV concerns center around generating them, not consuming them.

Generating a valid CSV string that Excel wont complain about is approximately 2 lines of LINQ if you have a clean starting point (i.e. a list of class instances with strongly-typed members).

For cases where I have to read garbage CSV, I first go to the source and determine if we can't get a better export or interface into the data. Most shit CSV I see is a consequence of someone hacking together a SQL query to produce the output. I'd be inclined to just request a database export at that point. Obviously not always an option, but you should always insist on having the best representation of the business data before you start banging your head against the wall on parsing it.

jastr · on May 15, 2020

I have a saas for non-devs to open big CSVs, so code I've written has parsed thousands of garbage CSVs (and failed to parse thousands others).

Lots of pitfalls are easy to avoid if you already know them, which it sounds like you do. The author is saying that most engineers aren't aware of these issues, and the author is correct.

Writing a for loop is easy, but so is using a language's built in CSV writer.

zelly · on May 15, 2020

Learned helplessness. You will never git gud if you're just copying and pasting libraries. I blame npm.

joshlk · on May 15, 2020

The Pandas CSV reader (Python) is one of the best and most versatile implementations out there. Also very fast.

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...

VWWHFSfQ · on May 15, 2020

Also xsv [0] and the related Rust csv crate [1]. xsv in particular is the best CLI csv tool I've ever used.

And I've used them all.

[0] https://github.com/BurntSushi/xsv

[1] https://github.com/BurntSushi/rust-csv

valera_rozuvan · on May 15, 2020

Wow! That's a nifty little library. Just looking at the amount of options going into the main parser routine is mind blowing [1].

And the figuring out of encoding is also complex [2].

Once again, this is just to demonstrate that writing your own CSV parser from scratch is a total waste of time. Just use tools provided by your language. Many languages provide native support for CSV parsing. For example see docs for Microsoft C# [3].

----------

[1] https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/io/p...

[2] https://github.com/pandas-dev/pandas/blob/v1.0.3/pandas/io/p...

[3] https://docs.microsoft.com/en-us/dotnet/csharp/programming-g...

tjalfi · on May 15, 2020

string.Split is far from native support.

If you don’t want to use third-party libraries then TextFieldParser[0] is part of the .NET framework.

I use CsvHelper[1] in my projects. It can do pretty much anything although the newer versions have some dependencies.

[0] https://docs.microsoft.com/en-us/dotnet/api/microsoft.visual...

[1] https://joshclose.github.io/CsvHelper/

nojito · on May 15, 2020

Disagree.

Data.table’s fread is leagues ahead of pandas.

https://www.rdocumentation.org/packages/data.table/versions/...

Fread has automatic footer detection and has automatic skip logic to help parse out mangled headers in some csvs.

jointpdf · on May 15, 2020

Additionally, RStudio has a built in GUI widget for importing .csv (and other) files: https://support.rstudio.com/hc/en-us/articles/218611977-Impo.... I find it quite useful for breezing through the common gotchas with importing data, and it’s probably the easiest way for inexperienced programmers (e.g. business analyst types used to working in Excel) to work with .csv.

vharuck · on May 15, 2020

fread can also accept a shell command which it will run and use the result as input. I use that to grep the lines I want from files too big for RAM.

nojito · on May 15, 2020

Same here!

I also use fwrite to send it back to the shell to continue my processing pipeline.

https://www.rdocumentation.org/packages/data.table/versions/...

It's mind boggling how little fanfare data.table has despite it being the best way to handle tabular data in 2020.

jsmith99 · on May 15, 2020

fread is amazing. Whenever I look at the docs and see all the options for dealing with nested lists (like the within-column-seperator) I always pray never to have need of such a thing.

roelschroeven · on May 15, 2020

In what ways is it better than the csv module in the Python Standard Library? Or in other words, in which cases is it worth the effort to install pandas just for reading CSV files if your project otherwise doesn't use pandas?

apiology · on May 15, 2020

I wrote a Python utility to help our folks at our analytics company load/unload/move data among CSV files and SQL databases.

Getting the parts that work with CSVs to "just work" when presented with an arbitrary CSV file has been super interesting - even with Pandas at our disposal.

https://github.com/bluelabsio/records-mover/blob/d18ec02fdf5...

I wrote up some of the experience here:

https://github.com/bluelabsio/records-mover/blob/master/docs...

ken · on May 15, 2020

Are we talking "fast", or merely "fast for Python"?

ganstyles · on May 15, 2020

It sounds like you may not consider anything python can do to be fast, so I'm thinking the latter.

lucb1e · on May 15, 2020

The built-in

    import csv

never failed to do everything I needed, what does this Panda thing add?

geraldbauer · on May 15, 2020

In Rubyland there's the versatile CsvReader library / gem [1] that supports HXL, ARFF, CVY and more "exotic" or modern flavors / dialects among many others.

[1]: https://github.com/csvreader/csvreader

andreareina · on May 15, 2020

N.B. Floats won't rountdrip by default (you need to call `read_csv(..., float_format="round_trip")`)

redleggedfrog · on May 15, 2020

Oh .csv, how I love to loath thee.

I have a funny story about .csv. Back in the day I was working on an integration with Fishbowl. Looking through their docs, the data format was your typical XML type stuff, until you got to the import requests. And there, lo and behold, was XML wrapping, you guessed it, .csv. I literally laughed out loud.

https://www.fishbowlinventory.com/wiki/Fishbowl_Legacy_API#I...

Years later they released an updated API. Upon checking it I discovered it's now JSON wrapping .csv, so, you know, progress.

https://www.fishbowlinventory.com/wiki/Fishbowl_API#Import

arethuza · on May 15, 2020

Paths in SVG are another example where the designers couldn't bring themselves to do everything strictly in XML and decided to use (space) delimited values:

   <path d="M 100 100 L 300 100 L 200 300 z"
        fill="red" stroke="blue" stroke-width="3" />

th0ma5 · on May 15, 2020

This is sort of a quasi standard way of describing paths as drawing operations. You'll find this in working with fonts as well and old plotters sort of.

jmull · on May 15, 2020

> Do you really still want to roll your own code to handle CSV?

No one really does, but we get backed into it due to these various issues.

The general issue is that there is no real standard for csv, yet it's often used as the touch-point for integration of disparate systems!

The integration goes something like this:

1. A and B decide to integrate! A will accept data from B.

2. A and B want to use a good interchange format, but after weeks of intense negotiations B can only get her IT department to agree to deliver csv (and feels lucky to get that much).

3. B uses excel to provide sample and test files, which are used by A for development or even in early stages of the live integration. Things seem to be going smoothly.

4. At some the export process at B changes, a new csv generator is used and now things break on A's side. Or something on the data input side changes at B so that new forms of data are now present that their csv generator does handle the way A expects, etc.

3pt14159 · on May 15, 2020

Eh—I had to do it recently. There was a file format that I was dealing with that was basically a combination of CSV for data and Yaml for headers (and some other edge case stuff) and the Python CSV reader couldn't handle some edge case that we needed handling, mostly around reporting back to the file provider where they messed up one thing or another in the file.

It didn't really take me that long. Lots of easy to write tests and the thing is processing a ton of data now for a well known stock exchange. Far easier than, say, the machine learning stuff I've written.

cryptonector · on May 15, 2020

The breakage can be subtle and unnoticed for a while, and it can be in the future. It sucks.

xamuel · on May 15, 2020

Funny story on this subject. When I was transitioning from academia to industry, I was afraid (having browsed hackernews so much) that I would be harshly judged for reinventing the wheel. So I removed the CSV library I had written from Github. Shortly thereafter, I was emailed by a well-known programmer (the creator of a fairly major language) asking why I had removed my CSV library, which he was using in one of his projects!

Here's the library, btw: https://github.com/semitrivial/csv_parser

DougBTX · on May 15, 2020

I think the most fun I had parsing a CSV file was when I found out it switched from UTF8 to UTF16 after several thousand lines, as differently encoded CSV files had been concatenated together directly into a single file...

usrusr · on May 15, 2020

When you hit that kind of data which isn't quite a corner case anymore but already in the realm of subtly broken, you'll actually be happy to have written your own ad-hoc CSV implementation. Figuring out that problem will be much harder if, because programming is scary, you are using a black box library.

Proper CSV handling is simple, just don't try to take shortcuts to make it even simpler than it is. There just happens to be a lot of CSV around that is too broken to be read without a bit of handholding, but no gold standard library will be able take that problem out of your hands (an argument might be made for libraries on the producer side of files). That little list of pitfalls in the article would be a good guideline for "things you should know before" but it doesn't make a particularly compelling point against rolling your own.

Personally I'd even argue that CSV is the only format where, on the consumer side, rolling your own is actually advisable. Because every beyond-corner-case is different and the format is such fertile soil for imperfect as-hoc processing that happens outside of your code.

GolDDranks · on May 15, 2020

I have a similar story: parsing a CSV file that had an unexpected 0x1A byte in the middle. Turns out the file was concatenated together in multiple passes from multiple files using the COPY DOS command, which by default adds ASCII SUB to the end as an EOF marker: https://en.wikipedia.org/wiki/Substitute_character

SanchoPanda · on May 15, 2020

I have done this to myself in the past. Those were....dark days.

GolDDranks · on May 15, 2020

The sad thing is that this happened last year.

oefrha · on May 15, 2020

You don't say, someone sent a CSV like that to me yesterday...

tonyedgecombe · on May 15, 2020

I've seen some terrible CSV files where people tried to squeeze hierarchical data into a flat file format. Usually the customer can't provide any sort of schema or documentation because if they could they would have recognised CSV isn't appropriate. This means you are in an endless cycle of fixing the code as unexpected items keep appearing in the data.

jointpdf · on May 15, 2020

I tend to think of dataframes (R, Python pandas) as the standard way for representing and working with the data contained in a .csv. “List columns” (https://jennybc.github.io/purrr-tutorial/ls13_list-columns.h...) inside dataframes can actually be a convenient way to represent data that is mostly flat, but has some hierarchical data associates it (recent example: dataframe representing a corpus of documents, with a column containing a list of citations). So while .csv probably isn’t the best format to store hierarchical data, it’s fairly trivial to work with if encountered.

jaxx75 · on May 15, 2020

Try a UTF8 CSV with Windows-1252 free-form text in some fields. I believe dealing with character encoding is the most frustrating thing a developer can do.

recursive · on May 15, 2020

> I believe dealing with character encoding is the most frustrating thing a developer can do.

I'll take your encoding work if you take my timezone work.

cjfd · on May 15, 2020

Well, considerations like these are typical for more or less any kind of software you will ever write. Because of this all but the very simplest software needs automated tests.

But one may also note that your particular usage of CSV may be a bit simpler than supporting all of the possible complications. Especially if one is not reading CSVs that could come from anywhere.

It depends, having dependencies is always a risk and a nuisance as well. The library you are using may have bugs, acquire bugs in the future, change in incompatible ways and so on.

polm23 · on May 15, 2020

At a former job we processed files, many CSV, from hundreds of different sources and I was in charge of cleaning up the code that did this once. There were a few tiny binaries without source called csv2tsv and csv2tsv2. No documentation of any kind of course.

csv2tsv just handled quoted fields and I was able to replace it with a few lines of Python without issue.

The csv2tsv2 program was used for CSV files from exactly one company. We couldn't ask them for technical assistance - it's likely their systems had been written years ago and running without updates since then - so I tried to figure out what the binary was doing. The input file had some null characters in it, but they seemed to be used inconsistently, and I was never able to figure out what that binary was supposed to do.

I left that binary alone and kept using it for a few years before a new guy joined and took over the system from me. I mentioned this weird old binary to him and over the course of a week he poked at it now and again before figuring out what it was doing. He used to work at a bank and realized it was using the same quoting method that some old data format they'd used there did - something about doubling characters and a few other tricks.

There's nothing simple someone won't make complicated.

SanchoPanda · on May 16, 2020

You are thinking of these: https://www.w3.org/Tools/csvtotab-vv

Agreed, they are excellent.

geraldbauer · on May 15, 2020

CSV is awesome :-). Leon Bambrick curates an Awesome CSV Page [1] and another Awesome CSV Page [2] by my humble self.

[1]: https://github.com/secretGeek/AwesomeCSV [2]: https://github.com/csvspecs/awesome-csv

JoBrad · on May 15, 2020

> When the final cockroach breathes her last breath, her dying act will be to scratch her date of death in a CSV file for posterity.

CSV is the Keith Richards of file formats.

LeonB · on May 17, 2020

Thanks Gerald ;)

dehesa · on May 15, 2020

I can relate. I have been one of those who started writing a CSV library and went down the rabbit hole of edge cases and configurability. Looking back (and only thinking on productivity) my time would have been better spent doing something else. I did, however, learn quite a lot along the way:

- how text is represented in a computer (and all its caveats),

- how to identify and tune slow code,

- how to vectorize code (thanks to Daniel Lemire's https://www.youtube.com/watch?v=wlvKAT7SZIQ ),

- how to encapsulate code better,

- how to use some great CSV tools out there (thanks to Leon's https://github.com/secretGeek/awesomeCSV )

- how to better manage and communicate an open source project.

All in all, I am still not sure it was a mistake. Moreover when Swift didn't really had, at the time, a good Codable interface to CSV. Next time I encounter a similar problem I will probably just wrap a C library and create a good interface, though.

In any case, if you are coding in Swift and find yourself in need of a CSV encoder/decoder, you might want to check https://github.com/dehesa/CodableCSV

asadjb · on May 15, 2020

My initial comment was going to be that "The Python CSV library is only 448 lines (378 LOC) [1]".

Only later did I realize that the bulk of the work is done by the C library, which is ~1500 LOC [2].

I guess parsing CSVs reliably is a reasonably difficult undertaking.

[1] https://github.com/python/cpython/blob/3.8/Lib/csv.py [2] https://github.com/python/cpython/blob/4a21e57fe55076c77b0ee...

JoBrad · on May 15, 2020

I have a base “library” I’ve used for several years that wraps the csv module (specifically DictReader/Writer) to handle encoding issues as well as data parsing. Have run into minor issues, but think that module alone is one of Python’s best use-cases.

jackfoxy · on May 15, 2020

And this is just the tip of the iceberg for data importation.

Managers and customer facing folks want to say Can't you just use off the shelf software for this? If your data importation needs are at all non-trivial, the problems tend to come in 2 categories.

1) Error reporting from off the shelf software is inadequate. If you are lucky, it will consistently point you to the correct row number. It's usually lacking any additional info.

2) If your application is at all non-trivial, you have specific importation needs, not easily covered by 3rd party generic solutions. For instance you want to import well formed tables with primary keys. Lots of solutions claim to handle this case, but you usually run into problems under category (1).

A programmer can overcome these difficulties and muddle through 3rd party solutions, but you probably don't want to hand this over to users. It just won't be a good experience for them.

I completed an importation library tailored for our specific needs and handling the real problems we have experience over many years. The front-end project to make this available to users started and stopped and changed hands over months. We are finally nearing deployment, and so far the users we have shown it to like it.

roelschroeven · on May 15, 2020

An additional problem is that Excel, the tool many people use to open CSV files, uses the computers regional settings for decimal separator and thousands separator. CSV files are used as an interchange format, so the format of the numbers in it does not depend on the region of that specific computer.

You can tell Excel to always use '.' as a decimal separator, but then it also uses it for presenting numbers to the user. It boggles my mind that software like Excel doesn't understand the difference between reading/formatting numbers for human use and reading/formatting numbers primarily meant for talking to other software.

sk5t · on May 15, 2020

Excel, while amazing on its own, is horrible for exchanging data in so many ways. I've maintained a policy at my company (life sciences) not to accept Excel files in any automated process; while parsing csv/tsv is a pain, xlsx/xlsb/etc. would amplify that pain in so many ways both predictable and surprising.

https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-al...

arethuza · on May 15, 2020

Parsing proper Excel files is fine - you just have to use a proper Excel parser such as EPPlus or Aspose.Cells or similar. In lots of ways this is actually easier than handling CSV files from random sources.

sk5t · on May 15, 2020

I've done it with Apache POI, and while that 'worked' it doesn't wash away the fundamental problems with Excel datatype guessing. CSV is not great either (no fun explaining encoding or invisible control characters and unusual whitespace to the average IT person) but it is nevertheless less bad.

tlbsofware · on May 15, 2020

Excel will also take a number with a leading zero and drop it while it converts it from csv, for ex 01234 -> 1234, which is SUPER bad if you are dealing with any sort of IDs to invoices or work orders from outside the company

sk5t · on May 15, 2020

That is a good occasion to curse Excel (how does it know the number isn't octal, too?) and also curse the person or group that designed a textual coding system consisting of digits with leading zeros. Looking at you, National Drug Code.

JoBrad · on May 15, 2020

FWIW, the new importer for Excel lets you easily preview and adjust this behavior before importing the entire file.

bproctor · on May 15, 2020

Well sure, if you're writing a generic CSV parser, it's very complicated. But if you know what your input looks like, CSV can be one of the simplest format to parse.

barbegal · on May 15, 2020

I think you need to be careful with choosing to use CSV format.

CSV works fine as long as you are not handling character stings. Character strings get messy as soon as they might introduce commas or characters outside of ASCII.

For tabular data which has character strings I prefer to used the ODF format http://opendocumentformat.org/developers/ (.ods file extension) which has good import and export capabilities from Excel, or Google sheets. If the user needs the data in CSV for entry into another application then they can convert within Excel and handle any conversion problems themselves.

guerby · on May 15, 2020

Do you have a preferred library for python for example)?

Debian packages python3-odf and https://github.com/eea/odfpy seems to still have some activity (vs the other listed).

barbegal · on May 15, 2020

I don't. In the past I have just used an XML and a zip library like this article explains https://www.codeguru.com/csharp/csharp/cs_data/xml/article.p...

shrimp_emoji · on May 15, 2020

https://docs.python.org/3/library/csv.html ? :o

wdavidw · on May 15, 2020

I am the lead developer behind the Node.js CSV parser [0]. The project started in 2010, an hour before me leaving work for the weekend. It ended up being a long, very long journey that isn't over yet. The project grew incorporating many features I would never have expected, fixing corner cases the several hundred tests were not covering and teaching me a lot about coding and project handling. The bottom line, writing a correct CSV parser takes a little bit more than implementing an advanced string spliting function.

[0] https://csv.js.org/

drej · on May 15, 2020

There's this system, which has become extremely popular for processing data (Apache Spark). And its default configuration cannot parse CSVs properly and can lead to pretty severe data loss. I wrote about this a while ago and it's still an issue. (Yes, we did lose data his way, it's not a theoretical exercise.)

https://kokes.github.io/blog/2019/07/09/losing-data-apache-s...

_vbdg · on May 15, 2020

I wrote my own CSV code rarher thsn using a library because the input CSVs were produced by guys who didn't use a library. The columns had a fixed "real width" but also had a column that contained an indeterminate number of comma-separated, values. One can escape commas in CSV for just such a purpose, but these guys didn't. So, to deal with this issue quickly, I had to basically parse the line forwards until I hit the variable-length field, then backwards until I hit it as well. Then I could parse the field to determine how many values it had and extract them.

beders · on May 15, 2020

My CSV story from last night:

I had these 15MB+ excel sheets and was trying to open them with Apache POI. I gave that code a generous 2GB memory: GC overhead limit reached.

Then I opened them in Excel, saved them as CSV, reached for a CSV-library and was done within seconds.

Ok, well, the story is more about how parsing XML can ruin your day ;)

mike22 · on May 17, 2020

Are you using POI’s eventmodel APIs to handle such files? The SAX model isn’t too bad for reading files.

Hitton · on May 15, 2020

It was not emphasized in the article, but the quotes can be both single and double.

That said, it's futile to try to write universal CSV parser. If you need to write your own, just make it possible to choose delimiter and type of quotes and call it a day. LibreOffice Calc does same.

franze · on May 15, 2020

I coded a few robots.txt parsers i.e.: (these are just the ones i cared to publish) https://github.com/franzenzenhofer/simple-functional-robots-... https://github.com/franzenzenhofer/robotstxt

and even though robots.txt seems like a very, very simple text based protocoll there are unanswered mysteries

the biggest mystery is user agent groups and comments

i.e.:

  User-agent: googlebot
  User-agent: bing
  User-agent: yandex
  Disallow: /

so i am disallowing everything for google, bing, yandex; easy enough, but:

  User-agent: googlebot

  User-agent: yandex
  Disallow: /

means that googlebot has no instructions, but yandex is disallowed all.

if a whole line is commented out

  User-agent: googlebot
  #User-agent: bing
  User-agent: yandex
  Disallow: /

is a commented out line a blank line or a non existing line?

if it is interpreted as a blank linke, the disallow only counts for yandex, if the line is non existant, it counts for googlebot and yandex.

i like simple things, but sometimes the complexity is between the lines.

iams · on May 15, 2020

  User-agent: googlebot
  #User-agent: bing
  User-agent: yandex
  Disallow: /

why would it be interpreted as a blank line? If you remove everything after the #, that includes the new line characters at the end of the line. Leaving:

  User-agent: googlebot
  User-agent: yandex
  Disallow: /

franze · on May 15, 2020

here the problem comes in, as people write

  User-agent: googlebot
  User-agent: bing    #behaved-badly
  User-agent: yandex
  Disallow: /

if # removes everythin afer the #, then it would be

  User-agent: googlebot
  User-agent: bing    User-agent: yandex
  Disallow: /

which results into the whole line " User-agent: bing User-agent: yandex" beeing thrown out as malformed, so only googlebot would be disallowed.

eyelidlessness · on May 15, 2020

It's ambiguous, and this is why significant whitespace can be so frustrating. Unless it's specified, some people will interpret a full-line comment as a blank line with a comment ending it (`^#.$`), and others will interpret it as you have (`^#.?\n`). Neither is obviously correct (even if it's obvious to you).

Edit: I don't know how to escape in HN formatting. Obviously there are italics where literal asterisks should be.

traes · on May 15, 2020

  *** You can just use three asterisks. ***

* You can just use three asterisks. *

   Unfortunately you need something after them though. ***

Unfortunately you need something after them though.

eyelidlessness · on May 17, 2020

Thanks, that's good to know!

DoreenMichele · on May 15, 2020

If you indent it two spaces, I think that should solve that problem:

   *(`^#.$`), and others will interpret it as you have (`^#.?\n`)*

(Not sure where you intended those asterisks. I made my best inference.)

eyelidlessness · on May 17, 2020

Yeah I was hoping to keep them inline, but indenting surely would have helped. I intended them after the dots (`.*`, dot asterisk in case I get the escaping wrong, is 0 or more characters in regex).

ThreeFx · on May 15, 2020

What's the solution? Semantically it makes sense if it counts for both googlebot and yandex.

Have you observed one/both behaviours?

franze · on May 15, 2020

i am consulting a lot of aggregators, and we coose to interpret sometimes as this, sometimes in the other scenario. the way i interpret the spec, it is not defined behaviour.

oefrha · on May 15, 2020

> What if there is non ASCII data?

> Just use utf8 right? But wait…

> What if the program reading the CSV use an encoding depending on the locale?

> A program can’t magically know what encoding a file is using. Some will use an encoding depending on the locale of the machine.

Excel (for Mac at least) is a fucking pain in this regard. Just try this minimal UTF-8 example:

  tmpfile=$(mktemp /tmp/XXXXX.csv); echo '“,”' > $tmpfile; open -a 'Microsoft Excel' $tmpfile

Hooray, you successfully opened

  ‚Äú,‚Äù

I'm not even sure in which encoding e280 9c2c e280 9d corresponds to that (not the usual suspect cp1252, nor any code page in the cp1250 to cp1258 range; easy to confirm with iconv).

One remedy is to add a BOM (U+FEFF) to the beginning of the file, but of course no one other than Microsoft (at least in my experience) uses this weird UTF-8 with BOM encoding (which the Unicode standard recommends against), so it breaks other programs correctly decoding UTF-8.

This means I can never share a non-ASCII CSV file with non-technical people. Always have to convert to .xlsx although it's usually easier for me to generate CSV. Then .xlsx opens me up to formatting problems, like phone numbers being treated as natural numbers and automatically displayed in scientific notation... Which means ssconvert or other naive conversion tools aren't enough, I need to use a library like xlsxwriter.

I'm not sure why it's so hard to just fucking ask when you don't know which encoding to use. (Plus it's not super hard to detect UTF-8. uchardet works just fine. Plus my locale is en_US.UTF-8, maybe that's a hint.)

jfk13 · on May 15, 2020

> Hooray, you successfully opened

> ‚Äú,‚Äù

> I'm not even sure in which encoding e280 9c2c e280 9d corresponds to that

That'll be MacRoman (not very surprisingly).

https://en.wikipedia.org/wiki/Mac_OS_Roman

oefrha · on May 15, 2020

Ah, I thought it has to be something Microsoft.

We can place the last modernization effort to this piece of code, then.

tasogare · on May 15, 2020

Yes, it’s a pain. However there is a solution to open an UTF-8 CSV file (with Data > From Text and a few clicks), but it’s true that googling something like that is out of the mind of most people.

oefrha · on May 15, 2020

I know. But imagine sending a file with patronizing instructions on how to open it, which will still be ignored — the other end will double click on the file, see garbage, and get back to you.

BlueTemplar · on May 15, 2020

What about Libre Office ?

eyelidlessness · on May 15, 2020

The problem is person at receiving end won't follow instructions for opening in Office, software they're moderately familiar with, but not enough to navigate a special open file process even with hand-holding.

I sincerely do not think the solution is instruct receiving person to install an even less familiar Office alternative.

zerocrates · on May 15, 2020

Typically the biggest problem is saving from Excel to UTF-8 since it really really wants to just use its locale-dependent default charset. Opening is trouble too, but people are occasionally good about noticing that and figuring out how to get to the import options... the re-conversion to a different charset on save happens transparently so people don't notice.

In recent versions there finally seems to just be a new "UTF-8 CSV" option in the Save As dialog.

papito · on May 15, 2020

Implementing your own CSV parser is one of the first lessons a green software engineer will learn in how to NOT try to re-invent the wheel.

Still, please don't import packages with just one goddamned 3-line function in it.

qayxc · on May 15, 2020

It's what happens when DRY becomes dogma instead of a guideline...

This leads to debacles like the npm left-pad & kik affairs and other scary shit like https://github.com/parro-it/awesome-micro-npm-packages

Yeah - have fun maintaining dependencies when every other module depends on one-liners that can be pulled or broken at random...

leokennis · on May 15, 2020

One small tip if you receive a CSV with a non-default separator (like ;):

* Open it in a text exitor

* Add "sep=;" on the first line (without the quotes)

* Now at least Excel will open it as intended

lucb1e · on May 15, 2020

Calc just asks upon opening, no hidden tricks needed. Whenever I see people struggling with Excel and CSV, also when pasting CSV data into Excel, I can't stop wondering why people still bother with the Office suite. It has zero advantages aside from the cost of change (time spent on re-learning where things are), yet we teach it to kids in school and send them reminders that Microsoft benevolently gives them very steep discounts to reduce piracy (never mentioning that you could also just use this other free thing...), never mind the legislation saying governments and schools should use free software unless there is some strong and documented reason to divert from the rule.

BlueTemplar · on May 15, 2020

IIRC Excel will ask for the separator anyway when you open the file?

brewmarche · on May 15, 2020

It does not and actually assumes locale-dependent separators (usually semicolons if your locale uses a comma as a decimal separator and comma otherwise).

If you want to specify separators you have to go via Text Import.

hnick · on May 15, 2020

It does not, at least not anymore.

SanchoPanda · on May 15, 2020

To be fair, sometimes calling some giant java library or installing tons of dependencies is just not realistic. Use the library for the mission critical stuff or the automated reporting without a doubt, but it's important to be practical and thinking on your feet as well.

In the meantime for quick analysis and testing 90% + can be accomplished with one line of (g)awk.

  awk -v FPAT='"[^"]*"|[^,]*' -v OFS='\t' '{$1=$1; print $0}' "$filename"

More on FPAT [here](https://www.gnu.org/software/gawk/manual/gawk.html#Splitting...)

nicoburns · on May 15, 2020

You can also use xsv for this kind of analysis of CSV files. Which is a unix-style tool designed for CSVs and with a proper CSV parser.

It's written in Rust so it's one binary - no runtime dependencies, and will happily chunk through multi-gigabyte files.

https://github.com/BurntSushi/xsv

csunbird · on May 15, 2020

Reminds me of https://infiniteundo.com/post/25326999628/falsehoods-program...

tonyedgecombe · on May 15, 2020

It applies to everything, dates, floating point numbers, strings, etc. What we would consider the basics always turn out to be much deeper than they appear on the surface.

atulatul · on May 15, 2020

Dating advice: if you habitually, unironically, and without good cause use any dating format but that which is mandated by ISO 8601 we cannot be friends, let alone more.

https://twitter.com/jwiechers/status/1205515440543424513 + the thread

freefriedrice · on May 15, 2020

After working int he world of CAD tools for decades, I cannot tell you the number of times I've seen a younger engineer try to write a CSV parser only to fall into a circular hacking deathspiral.

You can't just use regex/split to handle CSV, unless you have significant field cleaning BEFORE converting to csv.

In reality you need lexical analysis and grammatical rules to parse any string of symbols. This is often always overlooked by naive implementations.

I take issue with OP's claim that RFC4180 is not well-defined, but almost all of the cases the OP listed are literally in the spec.