Earth-1217: where every CSV file has two header rows: column names and units

csrtraveler · on Feb 18, 2024

Only Microsoft can fix this; no enhanced CSV proposal will matter until it’s baked into Excel. TSV is marginally better, but not enough.

CSV is the universal file format, and will be forever; nothing else is ever as readable and as portable. You can’t easily get JSON out of an Excel sheet, or easily read it in any text editor.

CSV needs just a few tweaks to make it easier to avoid comma, double-quote, and newline issues that suck a bazillion dollars of productivity out of the economy each year with broken data imports.

This is probably the lowest hanging fruit in all of technology: a small change to Excel to dramatically improve data processing / import-export for all humanity.

no_wizard · on Feb 18, 2024

I agree. Excel being so dominant in its market segments makes data interop harder for no other reason than they use proprietary default formats and don’t quite support CSV in a conformant way

Not quite related but I once saw a proposal that file formats should be regulated to require documented interoperability. This change would foster a ton of competition and evolve standards as an industry to the benefit to all is the argument in a nutshell.

If XLSX was documented for interop it would also go a long way in data independence

zokier · on Feb 18, 2024

You mean like ECMA-376, ISO/IEC 29500? See also https://learn.microsoft.com/en-us/openspecs/main/ms-openspec...

Some historical note: https://www.computerworld.com/article/2526864/microsoft-offe...

secondcoming · on Feb 18, 2024

How can they fix anything if other software isn't written to obey, or expect, headers?

anon84873628 · on Feb 18, 2024

Instead of looking at it in terms of the distinct number of different software products, instead look at the number of physical "integration operations" that happen every day.

A good portion of those (whether it is "the majority", who can say) are actually between Excel and itself, e.g. two people in a company sending files to each other.

This is a multi-agent coordination problem. If Excel (the dominant market participant) makes a change because it is helpful to their own users anyway, that alters the incentive calculus for the other software products too. Suddenly there is a large enough market / population of users / number of integration events where making an investment in the enhancement makes sense.

politelemon · on Feb 18, 2024

I agree it makes a lot of sense, and to help with the "marketing" they could even give it a slightly distinct but easy name to indicate it's a CSV flavour. Having something repeatable verbally makes it easier in conversations. I've sent you an Earthy CSV. Oh you can use x, it supports earthy.

MichaelZuo · on Feb 19, 2024

If they make it a default they will nonetheless break a huge amount of existing workflows to older Excel versions and other programs.

Why would they want to pay for the reputation and credibility damage for the multi-year long, or multi-decade long, transition period?

frizlab · on Feb 22, 2024

The worst thing by far in Excel’s implementation of CSV is the separator is actually a semicolon when the computer’s locale uses a comma for separating the decimals in a number.

acheron · on Feb 18, 2024

If we’re imagining a parallel universe where things make sense, let’s get rid of CSV files altogether and use the ASCII field separator characters. https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Field_...

EvanAnderson · on Feb 18, 2024

Yes. I did a project ETL'ing data that used the ASCII field and record separators once. It was a joy to deal with compared to CSV.

cpeterso · on Feb 19, 2024

This is definitely the Right Thing, but how well did other software, such as text editors, handle or render the field and record separator characters? That you can’t directly enter these characters on a keyboard would be a minor annoyance.

EvanAnderson · on Feb 19, 2024

The file I used was being generated by a partner company's ERP system so I didn't really have to make any files myself. Having said that, I use a lot of Notepad++ on Windows and using the ALT-029, ALT-028, etc, syntax generated the right codes for direct keyboard entry. They're displayed by Notepad++ as inverse text "RS", "GS", etc.

mjevans · on Feb 18, 2024

Is there any reason a 'csv' file can't use E.G.

\\002 2 0x02 STX (start of text)

OR

\036 30 0x1E RS (record separator)

as the record separator instead of a comma? The big reason CSV uses commas is to be human readable and editable, but otherwise...

ot1138 · on Feb 19, 2024

Ha, we used to do this in the 90s with double pipes || and triple pipes |||.

mjevans · on Feb 19, 2024

IIRC PHP BB used || for data separators. The abstraction often leaked.

somat · on Feb 18, 2024

So am I the only one who names my database columns like height_cm, speed_kmh, weight_kg

JohnBooty · on Feb 19, 2024

NO! There are at least two of us!

I've done this ever since NASA lost the Mars Climate Orbiter due to a unit conversion error.

https://en.wikipedia.org/wiki/Mars_Climate_Orbiter#Cause_of_...

I truly don't understand how and why people don't do this. Most people use pretty descriptive names now that every editor has autocomplete and big monitors are common. And yet, people never put the damn units in the identifier name. Would save so many bugs, and it would give engineers one less thing to juggle in their working memory.

davidbanham · on Feb 19, 2024

I’d settle for Excel either not automatically reformatting dates when opening a CSV, or having the decency to put them back the way they were when saving it. Silently turning 2016-01-02 into 01/02/2016 should be considered a hate crime.

Animats · on Feb 18, 2024

Column headers are a known issue in research data management.[1] Research projects tend to generate large amounts of data from instruments. The raw data is often archived. Years or decades later, someone may find something important in that expensively collected data.

[1] https://libguides.d.umn.edu/c.php?g=307731&p=2176534

zokier · on Feb 18, 2024

Or just use something like parquet that has both typed columns and allows arbitrary metadata for columns: https://mungingdata.com/pyarrow/arbitrary-metadata-parquet-t...

anon84873628 · on Feb 18, 2024

Well, sending data between machines isn't the problem. Presumably the reason for choosing CSV in the first place is to make it editable by humans in a basic text editor. People get in to trouble when they forget that or try to use it for other uses cases instead.

zokier · on Feb 18, 2024

> Presumably the reason for choosing CSV in the first place is to make it editable by humans in a basic text editor

That's quite a presumption

mbreese · on Feb 19, 2024

Yeah, I can’t think of a reason why I’d really want to hand edit a CSV file.

I think the rationale behind using CSV is even more basic. The reason why CSV is so popular (and TSV, HTML and JSON for that matter), is that it is a human readable format. It is much easier to troubleshoot and debug data and data flows if you can open a file and see it in an editor. Especially when trying to move data between systems/people/groups. You don’t really need to edit the file, just read and mentally parse it if there is a problem.

This is why the most popular formats are text based. It’s only when the verbosity of text is too great that we move to binary, such as for images and media (xpm excepted). For most other data, compression tends to be “good enough”.

wolverine876 · on Feb 18, 2024

Is there an open source library for parsing CSVs that has matured to cover 99.9% of edge cases, errors, etc.? We've had decades to build one. Otherwise, many projects must be recreating this wheel.

marcinzm · on Feb 19, 2024

There's plenty that can perfectly parse RFC 4180 since it's a very simple format. That's not, however, what people want in terms of parsing "CSV" files. They want to parse a file that contains data in some type of delimited format into some type of arbitrary format they envision. The delimiter may not even be a comma.

johnnyo · on Feb 19, 2024

I don’t think so, because their are always ambiguities with CSVs

Do they have a header? Is the data properly escaped? How are blank values at the end of the row handled? Etc, etc

wolverine876 · on Feb 19, 2024

> their are always ambiguities with CSVs

Right, that's the point of a good library, to handle the ambiguities via insightful coding solutions, accumulated over decades - enough solutions to make it worth using the library. Other ambiguities can be handled with Q&A which result in automated solutions: 'the text looks like either X or Y, which is it?', and then disambiguating the text based on the answer.

Another benefit would be that, if the library became useful and widely used enough, it would become an unofficial standard: people would make CSV's compatible with the library and its output would become a norm.

All a happy fantasy! :)

johnnyo · on Feb 20, 2024

Insightful solutions aren’t always correct solutions.

Take a look at how Excel helpfully makes things dates or numbers, even when you don’t want them to be treated as such.

There’s no “insightful” way to understand that a column of numbers really wants to be treated like text or that 20-DEC-82 is a product code vs a date.

nicoburns · on Feb 18, 2024

What I really want is a tabular format the enables formatting to be interpolated between sections of data with a clean separation of structured data and formatted human-readable content. It would need a new editor of course, but I reckon it could be popular if that editor was good.

bradrn · on Feb 19, 2024

This sounds interesting, but I’m not sure I understand your proposal fully… what do you mean by ‘interpolating’ formatting?

nicoburns · on Feb 19, 2024

Particularly in a business context, I find that spreadsheets often need a bunch of supporting data that exists outside of the main tabular content: headings, introduction paragraphs, legends, etc.

I would like a way to visually combine that with the tabular data while keeping the too distinct in the underling format such that:

- Constraints (e.g. type or format constraints) could be applied to the tabular data - The non-tabular data is free-form word processor style

It would also be be useful to be able to combine multiple tables with the same visual page.

bradrn · on Feb 20, 2024

> combine multiple tables with the same visual page

In fact, this is why I asked — I’m actively working on an program which allows just this. In theory, it should be sufficiently flexible to allow for the kind of markup you mention… one more thing to look into!

demondemidi · on Feb 18, 2024

There's a simpler answer: if you need something more sophisticated than CSV, then don't use CSV.

This attitude in young programmers kills me. It's the old "let's make one tool that does everything, and i (re)invented it!" attitude they get after a few years on the job.

elzbardico · on Feb 18, 2024

I am old too, but if anything I think that young programmers nowadays are not nearly as iconoclasts and rebels as they used to be. And I do miss that.

demondemidi · on Feb 18, 2024

that's because of the cambrian explosion of what being a "programmer" means today. there's no time to learn all the history anymore. at the start of the PC era it was possible to understand every microprocessor architecture on the market and how to program them and programming languages for PC applications were sparse (assembly was pretty much the only viable choice although some profitable applications were written in Pascal). i may be off base here but i think 40 years ago the majority of programmers knew 80-90% of the history and you just can't do that today unless your career is "historian".

nullserver · on Feb 18, 2024

These days it’s massive challenge to stay on top of JavaScript and its frameworks. Never-mind anything else.

croes · on Feb 18, 2024

Isn't it the opposite?

Instead of make something completely new that does everything expand the well known old by some useful features?

demondemidi · on Feb 18, 2024

My 0.02$: For something as deeply embedded and controversial as CSV this won’t work. I say this because I’ve been watching the battle since the 1990s. CSV has a lot of problems, but it is only an issue if you try to make it do too much. For a current example look at JSON. The original BNR grammar fit on the back of a business card. It has major issues too. Yet it hasn’t successfully been replaced or augmented without creating even worse problems, like compatibility. Putting lipstick on a pig, like this suggestion for CSV, just creates incompatibility and even more headaches.

Klonoar · on Feb 18, 2024

No programmer I know _wants_ to use CSV. It’s often a requirement foisted on them by another team or entity that they just have to work around.

demondemidi · on Feb 19, 2024

I strongly disagree based on decades of experience scripting in linux and needing to manipulate tabular data. Maybe you just don't know a lot of programmers. CSV is _extremely_ useful, if you know when and how to apply it and don't push it out of it's wheelhouse.

Klonoar · on Feb 19, 2024

> Maybe you just don't know a lot of programmers.

What a completely odd thing to say on a site like this, but alrighty.

Anyway, I'll be clear: I run a business that deals with CSVs and their oddities. I am acutely aware of how well they can work. This has nothing to do with the fact that most programmers I've met, given the choice, would prefer literally any other data format.

Brian_K_White · on Feb 19, 2024

You've just met at least 2 programmers who have no problem with csv.

Small specs that don't define very much have a value. Least common denominators have a value.

It's a problem to use the basic solution in an enterprise problem, but it's also a problem to use the enterprise solution to trivial problems.

Klonoar · on Feb 19, 2024

You're noting "2" when I just said "most". Doesn't need to be a measuring contest here.

Brian_K_White · on Feb 20, 2024

I'm suggesting maybe you make an assumption about what most programmers you know think about a topic you probably never specifically polled every one of them on, and that whatever numbers went into "most" there are at least 2 to add in the other column.

You are expressing your view, I am expressing mine, that is all.

IshKebab · on Feb 18, 2024

Right. And that other team should not use CSV. Adding multiple column headers isn't any more of a solution because it also requires that other team to fix their shit.

Klonoar · on Feb 19, 2024

I am commenting on the reasons why most programmers I've met get stuck with CSVs as a data format.

I am not commenting on some ill-fated attempt to add types to CSV columns.

Edit: removed a "not" in the wrong place.

anon84873628 · on Feb 18, 2024

In other words it's a classic multi-agent coordination problem (aka network effects), and saying "why doesn't one side unilaterally start acting differently" is a well answered question by game theory.

lmeyerov · on Feb 19, 2024

Earth-1, where we are happily already sharing parquet and arrow files for years now that do that :)

The subtle bit is most languages & these formats have data representation (int16, ..) but not actual unit (GB/s) and oh boy....

malkia · on Feb 18, 2024

I suggest listening to this https://www.youtube.com/watch?v=OBmNu6lQU2o while reading :)

SiempreViernes · on Feb 18, 2024

Imagine their face when they discover they actually live on Earth-1217: https://docs.astropy.org/en/stable/io/ascii/ecsv.html

prpl · on Feb 18, 2024

just associate some CSVW with it and don’t limit yourself to just units

justsomehnguy · on Feb 18, 2024

> where every CSV file has two header rows

sigh

Why people always try to make it harder?

    Name,string;Qty,units;Weight,kg;Volume,m^3;Just for the sake of it with spaces,string;No don't use comma in the header names - it doesn't make sense,string;

bobbylarrybobby · on Feb 18, 2024

Why not do it the other way? Name;string,Qty;units so that if you just double click to open in excel, or use a naive reader that doesn't understand the format (which most won't), the column names are basically correct?

cpeterso · on Feb 19, 2024

That’s a great idea and useful fallback behavior for file importers that don’t understand the type annotations.

Stylistically, I suggest using a colon like many programming languages do for type annotations:

  Name:string,Qty:units

readthenotes1 · on Feb 18, 2024

I think it has more to do with Excel properly recognizing the data type rather than humans doing so. It has been too long since I dealt with it to recall the troubles but I know there are things that should be easy that aren't since you can't tell Excel how to interpret the data.

Kluggy · on Feb 18, 2024

In the fictional world where excel understands the second header row, it could just as easily understand a unit delimiter and units in a single header row.

Either would make my day as a programmer though.

secondcoming · on Feb 18, 2024

> No don't use comma in the header names

You just know that someone somewhere will do just that and you'll have to deal with it as best you can.

mjevans · on Feb 18, 2024

Even a newline could be used per header to add a second 'line' that has the units. It makes sense to humans and computers could easily read that.

candiddevmike · on Feb 18, 2024

But this will break all of my unquoted columns with semicolons in them!

2nenaoki · on Feb 18, 2024