“The CSV file format is not fully standardized. Separating fields with commas is the foundation, but commas in the data or embedded line breaks have to be handled specially. Some implementations disallow such content while others surround the field with quotation marks, which yet again creates the need for escaping if quotation marks are present in the data.
The term "CSV" also denotes several closely-related delimiter-separated formats that use other field delimiters such as semicolons.[2] These include tab-separated values and space-separated values. A delimiter guaranteed not to be part of the data greatly simplifies parsing.
Alternative delimiter-separated files are often given a ".csv" extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons are often used instead of commas in many European locales in order to use the comma as the decimal separator and, possibly, the period as a decimal grouping character.”
I think it's valid to argue that you shouldn't be able to put some of commas, quotes, and newlines inside fields at all. And comma versus semicolon.
But that doesn't extend to using backslash escapes in something that's legitimately trying to be CSV. That's someone getting confused and implementing a mix of data formats, or trying to be clever and making an extended CSV format.
It’s valid to argue that, but that means you can’t use CSV for many real-world data sets.
That, in turn, means you almost cannot use CSV in any robust solution. Even if, today, your input doesn’t have commas, quotes or newlines, can you guarantee it won’t tomorrow, next year, etc?
Unless it's a CSV file exported from a Nordic locale Excel, in which case your CSV exports will use semi-colon as column separators and commas as decimal points. And yes the filename will still end with ".csv"
So the following Excel export I just did will parse perfect fine with your CSV parser but give you completely the wrong thing:
That has nothing to do with the Nordics, but with the decimal separator. In locales that use a comma as the decimal separator (i.e., most European locales), Excel uses a semicolon as CSV separator.
I thought it strange too. I saw what the issue was, and just "fixed" it be correcting the data in the CSV. For the lulz, I guess I could have played with the parser's options on deciding what is needed to be escaped. However, the data would have still been incorrect as the '\' is definitely not part of the desired content, so ultimately it was better to correct the input. i would kind of rather the import die than having the potential foot gun of '\' in a field for later sabotage.
If I’m not mistaken RFC 4180 says that quotes should be escaped by prepending them with another quote, so “” and not \” (these are not double quotes but my phone won’t let me type normal quotes), but yeah I guess it is a rather perverse value to put in a csv.
Some planning, some building out new stuff (usually clean work), some repairing active geysers of partially processed data that's getting all over the place fouling up the works.
I have. All I'm saying is this plus someone who left 10 years ago who can't and shouldn't have written a CSV parser using regular expressions. Input row: