Have you had to work with csv files from the wild much? I'm not being snarky but...

dspillett · 2025-03-27T11:19:31 1743074371

WRT JSON:

> Headers are in each line

This might be my old “space and network cost savings” reflex, which is a lot less necessary these days, kicking in, but the feels inefficient. It also gives rise to not knowing the whole schema until you read the whole dataset (which might be multiple files), unless some form of external schema definition is provided.

Having said that, I accept that JSON has advantages over CSV, even if all that is done is translating a data-table into an array of objects representing one row each.

> utf8 has never been an issue for me

The main problem with UTF8 isn't with CSV generally, it is usually, much like the “first column is called ID” issue, due to Excel. Unfortunately a lot of people interact with CSVs primarily with Excel, so it gets tarred with that brush by association. Unless Excel sees the BOM sequence at the start of a CSV file, which the Unicode standards recommend against for UTF8, it assumes its characters are using the Win1252 encoding (almost, but not quite, ISO-8859-1).

> Csv files have all kinds of separators

I've taken to calling them Character Separated Value files, rather than Comma, for this reason.

IanCal · 2025-03-27T11:29:40 1743074980

Yes, it's not great. Space is annoying, though compression pretty much removes that as a concern (zstd is good for this, you can even have a custom dictionary). And yes, missing keys is annoying.

JSONL is handy, JSON that's in the form {data: [...hundred megs of lines]} is annoying for various parsers.

I'm quite a fan of parquet, but never expect to receive that from a client (alas).

Izkata · 2025-03-27T14:29:42 1743085782

> JSON that's in the form {data: [...hundred megs of lines]} is annoying for various parsers.

One reason this became common was a simple protection against json hijacking: https://haacked.com/archive/2009/06/25/json-hijacking.aspx/

cogman10 · 2025-03-27T12:59:40 1743080380

Parquet should get the praise. It's simply awesome.

It's what I'd pick for tabular data exchange.

A recent problem I solved with it and duckdb allowed me to query and share a 3M record dataset. The size? 50M. And my queries all ran subsecond. You just aren't going to get that sort of compression and query-ability with a csv.

SkyBelow · 2025-03-27T17:59:31 1743098371

I wonder if CSV is the trivial format, so you have many people picking it because they want the easiest, and still getting it wrong. JSON is harder, so very few people are going to roll their own serializer/deserializer, and those who do are more likely to focus on getting it right (or at least catching the really obvious bugs).

I've dealt with incorrect CSVs numerous times, never with incorrect JSON, but, of the times I know what was happening on the other system, each time the CSV was from some in house (or similar) implementation of dumping a SQL output (or similar) into a text file as an MVP. JSON was always using some library.

If so, that's all the more reason to love CSV as it stands guard for JSON. If CSV didn't exist, we would instead have broken JSON implementations. (JSON and XML would likely then share a similar relationship.)

Gormo · 2025-03-27T18:29:45 1743100185

Sometimes people interpret the term too generically and actually implement a high degree of non-trivial, very idiosyncratic complexity, while still calling it "CSV".

One project I worked on involved a vendor promising to send us data dumps in "CSV format". When we finally received their "CSV" we had to figure out how to deal with (a) global fields being defined in special rows above the header row, and (b) a two-level hierarchy of semicolon-delimited values nested within comma-delimited columns. We had to write a custom parser to complete the import.

ttyprintk · 2025-03-28T04:07:54 1743134874

Hi,

Yes, we chose ARFF format, which is idiosyncratic yet well-defined back in the old data mining days.

lelanthran · 2025-03-27T07:19:25 1743059965

Sure, I get your arguments and we're probably mostly in agreement, but in practice I see very few problems arising with using CSV.

I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.

I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.

And yet, they work.

The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you. It's done multiple times a day, on multiple systems, in multiple companies.

And yet, they work.

I get your argument though - a JSON array of arrays can represent everything that CSV can, and is preferable to CSV, and is what I would choose when given the choice, but the issues with using that are not going to be fewer than issues with CSV using RFC1480.

fauigerzigerk · 2025-03-27T10:12:18 1743070338

>but in practice I see very few problems arising with using CSV

That is not my experience at all. I've been processing CSV files from financial institutions for many years. The likelihood of brokenness must be around 40%. It's unbelievable.

The main reason for this is not necessarily the CSV format as such. I believe the reason is that it is often the least experienced developers who are tasked with writing export code. And many inexperienced developers seem to think that they can generate CSV without using a library because the format is supposedly so simple.

JSON is better but it doesn't help with things like getting dates right. XML can help with that but it has complexities that people get wrong all the time (such as entities), so I think JSON is the best compromise.

kragen · 2025-03-27T17:12:49 1743095569

> And many inexperienced developers seem to think that they can generate CSV without using a library because the format is supposedly so simple.

Can't they?

    def excel_csv_of(rows):
      for row in rows:
        for i, field in enumerate(row):
          if i:
            yield ','
          yield '"'
          for c in field:
            yield '""' if c == '"' else c
          yield '"'
        yield '\n'

I haven't tested this, even to see if the code parses. What did I screw up?

fauigerzigerk · 2025-03-27T20:52:51 1743108771

>Can't they?

If my experience reflects a relevant sample then the answer is that most can but a very significant minority fails at the job (under the given working conditions).

Whether or not _you_ can is a separate question. I don't see anything wrong with your code. It does of course assume that whatever is contained in rows is correct. It also assumes that the result is correctly written to a file without making any encoding mistakes or forgetting to flush the stream.

Not using name value pairs makes CSV more prone to mistakes such as incorrect ordering or number of values in some rows, a header row that doesn't correspond with the data rows, etc. Some export files are merged from multiple sources or go through many iterations over many years, which makes such mistakes far more likely.

I have also seen files that end abruptly somewhere in the middle. This isn't specific to CSV but it is specific to not using libraries and not using libraries appears to be more prevalent when people generate CSV.

You'd be surprised how many CSV files are out there where the developer tried to guess incorrectly whether or not a column would ever have to be escaped. Maybe they were right initially and it didn't have to be escaped but then years later something causes a change in number formats (internationalisation) and bang, silent data corruption.

Prioritising correctness and robustness over efficiency as you have done is the best choice in most situations. Using a well tested library is another option to get the same result.

gthompson512 · 2025-03-27T19:23:20 1743103400

This forces each field to be quoted, and it assumes that each row has the same fields in the same order. A library can handle the quoting issues and fields more reliably. Not sure why you went with a generator for this either.

Most people expect something like `12,,213,3` instead of `"12","213","3"` which yours might give.

https://en.wikipedia.org/wiki/Comma-separated_values#Basic_r...

kragen · 2025-03-27T19:34:46 1743104086

Forcing each field to be quoted is always correct, isn't it? How could something be "more reliable" than something that is always correct?

With respect to "the same fields in the same order", no, although you may or may not feed the CSV to an application that has such an expectation. But if you apply it to data like [("Points",),(),("x","y"),("3","4"),("6","8","10")] it will successfully preserve that wonky structure in a file Excel can ingest reliably. (As reliably as Excel can ingest anything, anyway, since Excel has its own Norway problem.)

It's true that it's possible to produce more optimized output, but I didn't claim that the output was optimal, just correct.

Using generators is necessary to be able to correctly output individual fields that are many times larger than physical memory.

IanCal · 2025-03-27T10:16:23 1743070583

I'll preface this that I think we are mostly in agreement, so that's the friendly tone of reply, part of this is just having flashbacks.

It's massively used, but the lack of adherence to a proper spec causes huge issues. If you have two systems that happen to talk properly to each other, great, but if you are as I was an entrypoint for all kinds of user generated files it's a nightmare.

CSV is the standard, sure, but it's easy to write code that produces it that looks right at first glance but breaks with some edge case. Or someone has just chosen a different separator, or quote, so you need to try and detect those before parsing (I had a list that I'd go through, then look for the most commonly appearing non-letter character).

The big problem is that the resulting semantically broken csv files often look pretty OK to someone scanning them and permissive parsers. So one system reads it in, splits something on lines and assumes missing columns are blank and suddenly you have the wrong number of rows, then it exports it. Worse if it's been sorted before the export.

Of course then there's also the issues around a lack of types, so numbers and strings are not distinguishable automatically leading to broken issues where you do want leading zeros. Again often not identified until later. Or auto type detection in a system breaking because it sees a lot of number-like things and assumes it's a number column. Without types there's no verification either.

So even properly formatted CSV files need a second place for metadata about what types there are in the file.

JSON has some of these problems too, it lacks dates, but far fewer.

> but the issues with using that are not going to be fewer than issues with CSV using RFC1480.

My only disagreement here is that I've had to deal with many ingest endpoints that don't properly support that.

Fundamentally I think nobody uses CSV files because they're a good format. They've big, slow to parse, lack proper typing, lack columnar reading, lack fast jumping to a particular place, etc.

They are ubiquitous, just not good, and they're very easy to screw up in hard to identify or fix ways.

Finally, lots of this comes up because RFC4180 is only from *2005*.

Oh, and if I'm reading the spec correctly, RFC4180 doesn't support UTF8. There was a proposed update maybe in 2022 but I can't see it being accepted as an RFC.

watwut · 2025-03-27T15:40:12 1743090012

> I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.

And there are constant issues arising from that. You basically need a small team to deal with them in every institution that is processing them.

> I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.

Salesman of enterprise system do not care about issues programmers and clients have. They care about what they can sell to other businessmen. That teams on both sides then waste time and money on troubleshooting is no concern to the salesman. And I am saying that as someone who worked on the enterprise system that consumed a lot of csv. It does not work and process of handling them literally sometimes involved phone calls to admins of other systems. More often then would be sane.

> The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you.

That is perfectly fine as long as it is a manager downloading data so that he can manually analyze them. It is pretty horrible when those files are then uploaded to other systems.

flanked-evergl · 2025-03-27T07:48:23 1743061703

In practice, I have never ever received CSV to process that complied with RFC 4180, and in most cases it was completely incoherent and needed incredibly special handling to handle all the various problems like lack of escaping.

SAP has been by far the worst. I never managed to get data out of it that were not completely garbage and needed hand crafted parsers.

consp · 2025-03-27T08:13:50 1743063230

SAP only has to be SAP and MS Excel compatible. The rest is not needed so in their eyes it is probably to spec.

flanked-evergl · 2025-03-27T10:09:08 1743070148

European quality™.

tikhonj · 2025-03-27T12:47:33 1743079653

> And yet, they work.

Through a lot of often-painful manual intervention. I've seen it first-hand.

If an organization really needs something to work, it's going to work somehow—or the organization wouldn't be around any more—but that is a low bar.

In a past role, I switched some internal systems from using CSV/TSV to using Parquet and the difference was amazing both in performance and stability. But hey, the CSV version worked too! It just wasted a ton of people's time and attention. The Parquet version was far better operationally, even given the fact that you had to use parquet-tools instead of just opening files in a text editor.

recursive · 2025-03-27T19:45:07 1743104707

> There aren't vast numbers of different JSON formats.

Independent variations I have seen:

* Trailing commas allowed or not * Comments allowed or not * Multiple kinds of date serialization conventions * Divergent conventions about distinguishing floating point types from integers * Duplicated key names tolerated or not * Different string escaping policies, such as, but not limited to "\n" vs "\x0a"

There are bazillions of JSON variations.

thayne · 2025-03-28T04:11:35 1743135095

> Trailing commas allowed or not

The json spec does not allow commas. Although there are jsom supersets that do.

> Comments allowed or not

The json spec does not allow comments. Although there are jsom supersets that do.

> Multiple kinds of date serialization conventions

Json spec doesn't say anything about dates. That is dependent on your application schema.

> Divergent conventions about distinguishing floating point types from integers

This is largely due to divergent ways different programming languages handle numbers. I won't say jsom handles this the best, but any file format used across multiple languages will run into problems with differences in how numbers are represented. At least there is a well defined difference between a number and a string, unlike csv.

> Duplicated key names tolerated or not

According to the spec, they are tolerated, although the semantics of such keys is implementation defined.

> Different string escaping policies, such as, but not limited to "\n" vs "\x0a"

Both of those are interpreted as the same thing, at least according to the spec. That is an implementation detail of the serializer, not a different language.

recursive · 2025-03-28T16:42:09 1743180129

And CSV parsers and serializers compliant with RFC 4180 are similarly reliable.

thayne · 2025-03-28T18:02:39 1743184959

But many, perhaps most, parsers and serializers for CSV are not compliant with RFC 4180.

RFC 4180 is not an official standard. The text of the RFC itself states:

> This memo provides information for the Internet community. It does > not specify an Internet standard of any kind.

CSVs existed long before that RFC was written, and it is more a description of CSVs that are somewhat portable, not a definitive specification.

IanCal · 2025-03-28T18:40:07 1743187207

That RFC doesn't even support utf8.

It is, and accepts it is, codifing best practices rather than defining an authoritative standard.

IanCal · 2025-03-28T12:47:25 1743166045

There are always many, but in comparison to csv I've received almost no differences. Json issues were rare but csv issues it was common to have a brand new issue per client.

Typically the big difference is there are different parsers that are less tolerant of in spec values. Clickhouse had a more restrictive parser, and recently I've dealt with matrix.

Maybe I've been lucky for json and unlucky for csv.

gpvos · 2025-03-27T07:36:57 1743061017

What's the problem with capital I?

Someone · 2025-03-27T09:18:54 1743067134

https://superuser.com/questions/210027/why-does-excel-think-... says itks not capital I but “ID”.

Basically, Excel uses the equivalent of ‘file’ (https://man7.org/linux/man-pages/man1/file.1.html), sees the magic “ID”, and decides a SYLK file, even though .csv files starting with “ID” have outnumbered .SYLK files by millions for decades.

gpvos · 2025-03-27T14:19:34 1743085174

Thanks. So I guess the easy compatible solution is to always quote the first item on the first line when writing CSV. Good to know. (Checking if the item starts with ID is more work. Possibly quote all items on the first line for simplicity.) (Reading SYLK is obviously irrelevant, so accepting unquoted ID when reading is the smarter way to go and will actually improve compatibility with writers that are not Excel. Also it takes no work.)

IanCal · 2025-03-27T09:05:37 1743066337

The byte for a capital I is the same as the start for an odd file format, slyk maybe? Excel has (or did if they finally fixed it) for years decided this was enough to assume the file (called .csv) cannot possibly be csv but must actually be slyk. It then parses it as such, and is shocked to find your slyk file is totally broken!

boogheta · 2025-03-27T09:24:15 1743067455

It sounds to me like as often the problem here is Excel, not CSV

pasc1878 · 2025-03-27T11:09:12 1743073752

Yes but in practice CSV is defined by what Excel does.

As there is no standard to which Excel conforms as it predates standards and there would be an outcry if Excel started rejecting files that had worked for years.

imtringued · 2025-03-27T15:10:01 1743088201

There is a common misconception here. You can import CSV files into an excel sheet. You cannot open a CSV file with excel. That is a nonsense operation.

melagonster · 2025-03-28T07:14:49 1743146089

Excel do not ask the user whether they want to import file, and tell user their file was broken.

IanCal · 2025-03-27T09:47:25 1743068845

Clients don't particularly make the distinction, and in a way nor should they - they can't open your file.

n_plus_1_acc · 2025-03-27T08:19:01 1743063541

Probably referring to the "turkish i problem"

gpvos · 2025-03-27T14:21:12 1743085272

Not an unreasonable guess, but it turned out to be something different.