Hacker News new | past | comments | ask | show | jobs | submit login

CSV you say? Let's hope Bobby Droptable's cousin Bobby Double Quote Comma-Semicolon don't get in it...



1.5 million rows you say? Better hope nobody opens it in Excel and overwrites it…


Hopefully the first column isn't ID either[1]

[1]: https://alunr.com/excel-csv-import-returns-an-sylk-file-form...



Software is hard...


I just had one today that had the last char of a field as '\', which escaped the closing quote and then munged the column count from there.


Isn't that your parser's problem? I don't think true CSV has any special characters other than comma, quote, and CRLF.


There is no “true CSV”. https://en.wikipedia.org/wiki/Comma-separated_values:

“The CSV file format is not fully standardized. Separating fields with commas is the foundation, but commas in the data or embedded line breaks have to be handled specially. Some implementations disallow such content while others surround the field with quotation marks, which yet again creates the need for escaping if quotation marks are present in the data.

The term "CSV" also denotes several closely-related delimiter-separated formats that use other field delimiters such as semicolons.[2] These include tab-separated values and space-separated values. A delimiter guaranteed not to be part of the data greatly simplifies parsing.

Alternative delimiter-separated files are often given a ".csv" extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons are often used instead of commas in many European locales in order to use the comma as the decimal separator and, possibly, the period as a decimal grouping character.”

https://en.wikipedia.org/wiki/Comma-separated_values#Standar... mentions a few standards for csv, one of which is the MIME type text/csv, standardized in RFC 4180.


I think it's valid to argue that you shouldn't be able to put some of commas, quotes, and newlines inside fields at all. And comma versus semicolon.

But that doesn't extend to using backslash escapes in something that's legitimately trying to be CSV. That's someone getting confused and implementing a mix of data formats, or trying to be clever and making an extended CSV format.


It’s valid to argue that, but that means you can’t use CSV for many real-world data sets.

That, in turn, means you almost cannot use CSV in any robust solution. Even if, today, your input doesn’t have commas, quotes or newlines, can you guarantee it won’t tomorrow, next year, etc?


> Even if, today, your input doesn’t have commas, quotes or newlines, can you guarantee it won’t tomorrow, next year, etc?

But... but those are the ones I listed as real special characters, unlike backslash. I don't understand the question.


Yes? What do I not know that makes this question harder than it seems to me?


Every field is quoted. Every field is quote-escaped. That's all you need.


There are simpler and better approaches. Fortunately for us all, IP datagrams are not JSON-encoded, for example.


Unless it's a CSV file exported from a Nordic locale Excel, in which case your CSV exports will use semi-colon as column separators and commas as decimal points. And yes the filename will still end with ".csv"

So the following Excel export I just did will parse perfect fine with your CSV parser but give you completely the wrong thing:

  1;2,3;3,3
  2;7,3;9,3
  3;4,5;7,5


That has nothing to do with the Nordics, but with the decimal separator. In locales that use a comma as the decimal separator (i.e., most European locales), Excel uses a semicolon as CSV separator.


I thought it strange too. I saw what the issue was, and just "fixed" it be correcting the data in the CSV. For the lulz, I guess I could have played with the parser's options on deciding what is needed to be escaped. However, the data would have still been incorrect as the '\' is definitely not part of the desired content, so ultimately it was better to correct the input. i would kind of rather the import die than having the potential foot gun of '\' in a field for later sabotage.


How do I name my child CRLF?


earn Musk money, you can name your child WTF you want


i think you failed to parse the technical joke


no, i totally got the line ending joke. i'm not an ID10T


If I’m not mistaken RFC 4180 says that quotes should be escaped by prepending them with another quote, so “” and not \” (these are not double quotes but my phone won’t let me type normal quotes), but yeah I guess it is a rather perverse value to put in a csv.


Note that virtually no one implements that RFC. CSV is merely a style of file, not a true file format.


That's how Excel does it though


Oh don’t. I spent several hours last week unfucking one of those.


Thank you, I'm considering adopting this as job title, like "senior unfucker".


Intermediate Fucker reporting for duty.


Mine is principal dung roller.


I’ve long described myself as half janitor, half firefighter.


Data Plumber.

Some planning, some building out new stuff (usually clean work), some repairing active geysers of partially processed data that's getting all over the place fouling up the works.


I sometimes fill in my name as [object Object].


Do tell! If you’ve recovered ;)


I have. All I'm saying is this plus someone who left 10 years ago who can't and shouldn't have written a CSV parser using regular expressions. Input row:

  A,B,Alice "Mallory" Bob,123




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: