Only Microsoft can fix this; no enhanced CSV proposal will matter until it’s baked into Excel. TSV is marginally better, but not enough.
CSV is the universal file format, and will be forever; nothing else is ever as readable and as portable. You can’t easily get JSON out of an Excel sheet, or easily read it in any text editor.
CSV needs just a few tweaks to make it easier to avoid comma, double-quote, and newline issues that suck a bazillion dollars of productivity out of the economy each year with broken data imports.
This is probably the lowest hanging fruit in all of technology: a small change to Excel to dramatically improve data processing / import-export for all humanity.
I agree. Excel being so dominant in its market segments makes data interop harder for no other reason than they use proprietary default formats and don’t quite support CSV in a conformant way
Not quite related but I once saw a proposal that file formats should be regulated to require documented interoperability. This change would foster a ton of competition and evolve standards as an industry to the benefit to all is the argument in a nutshell.
If XLSX was documented for interop it would also go a long way in data independence
Instead of looking at it in terms of the distinct number of different software products, instead look at the number of physical "integration operations" that happen every day.
A good portion of those (whether it is "the majority", who can say) are actually between Excel and itself, e.g. two people in a company sending files to each other.
This is a multi-agent coordination problem. If Excel (the dominant market participant) makes a change because it is helpful to their own users anyway, that alters the incentive calculus for the other software products too. Suddenly there is a large enough market / population of users / number of integration events where making an investment in the enhancement makes sense.
I agree it makes a lot of sense, and to help with the "marketing" they could even give it a slightly distinct but easy name to indicate it's a CSV flavour. Having something repeatable verbally makes it easier in conversations. I've sent you an Earthy CSV. Oh you can use x, it supports earthy.
The worst thing by far in Excel’s implementation of CSV is the separator is actually a semicolon when the computer’s locale uses a comma for separating the decimals in a number.
This is definitely the Right Thing, but how well did other software, such as text editors, handle or render the field and record separator characters? That you can’t directly enter these characters on a keyboard would be a minor annoyance.
The file I used was being generated by a partner company's ERP system so I didn't really have to make any files myself. Having said that, I use a lot of Notepad++ on Windows and using the ALT-029, ALT-028, etc, syntax generated the right codes for direct keyboard entry. They're displayed by Notepad++ as inverse text "RS", "GS", etc.
I truly don't understand how and why people don't do this. Most people use pretty descriptive names now that every editor has autocomplete and big monitors are common. And yet, people never put the damn units in the identifier name. Would save so many bugs, and it would give engineers one less thing to juggle in their working memory.
I’d settle for Excel either not automatically reformatting dates when opening a CSV, or having the decency to put them back the way they were when saving it. Silently turning 2016-01-02 into 01/02/2016 should be considered a hate crime.
Column headers are a known issue in research data management.[1] Research projects tend to generate large amounts of data from instruments. The raw data is often archived.
Years or decades later, someone may find something important in that expensively collected data.
Well, sending data between machines isn't the problem. Presumably the reason for choosing CSV in the first place is to make it editable by humans in a basic text editor. People get in to trouble when they forget that or try to use it for other uses cases instead.
Yeah, I can’t think of a reason why I’d really want to hand edit a CSV file.
I think the rationale behind using CSV is even more basic. The reason why CSV is so popular (and TSV, HTML and JSON for that matter), is that it is a human readable format. It is much easier to troubleshoot and debug data and data flows if you can open a file and see it in an editor. Especially when trying to move data between systems/people/groups. You don’t really need to edit the file, just read and mentally parse it if there is a problem.
This is why the most popular formats are text based. It’s only when the verbosity of text is too great that we move to binary, such as for images and media (xpm excepted). For most other data, compression tends to be “good enough”.
Is there an open source library for parsing CSVs that has matured to cover 99.9% of edge cases, errors, etc.? We've had decades to build one. Otherwise, many projects must be recreating this wheel.
There's plenty that can perfectly parse RFC 4180 since it's a very simple format. That's not, however, what people want in terms of parsing "CSV" files. They want to parse a file that contains data in some type of delimited format into some type of arbitrary format they envision. The delimiter may not even be a comma.
Right, that's the point of a good library, to handle the ambiguities via insightful coding solutions, accumulated over decades - enough solutions to make it worth using the library. Other ambiguities can be handled with Q&A which result in automated solutions: 'the text looks like either X or Y, which is it?', and then disambiguating the text based on the answer.
Another benefit would be that, if the library became useful and widely used enough, it would become an unofficial standard: people would make CSV's compatible with the library and its output would become a norm.
What I really want is a tabular format the enables formatting to be interpolated between sections of data with a clean separation of structured data and formatted human-readable content. It would need a new editor of course, but I reckon it could be popular if that editor was good.
Particularly in a business context, I find that spreadsheets often need a bunch of supporting data that exists outside of the main tabular content: headings, introduction paragraphs, legends, etc.
I would like a way to visually combine that with the tabular data while keeping the too distinct in the underling format such that:
- Constraints (e.g. type or format constraints) could be applied to the tabular data
- The non-tabular data is free-form word processor style
It would also be be useful to be able to combine multiple tables with the same visual page.
> combine multiple tables with the same visual page
In fact, this is why I asked — I’m actively working on an program which allows just this. In theory, it should be sufficiently flexible to allow for the kind of markup you mention… one more thing to look into!
There's a simpler answer: if you need something more sophisticated than CSV, then don't use CSV.
This attitude in young programmers kills me. It's the old "let's make one tool that does everything, and i (re)invented it!" attitude they get after a few years on the job.
that's because of the cambrian explosion of what being a "programmer" means today. there's no time to learn all the history anymore. at the start of the PC era it was possible to understand every microprocessor architecture on the market and how to program them and programming languages for PC applications were sparse (assembly was pretty much the only viable choice although some profitable applications were written in Pascal). i may be off base here but i think 40 years ago the majority of programmers knew 80-90% of the history and you just can't do that today unless your career is "historian".
My 0.02$: For something as deeply embedded and controversial as CSV this won’t work. I say this because I’ve been watching the battle since the 1990s. CSV has a lot of problems, but it is only an issue if you try to make it do too much. For a current example look at JSON. The original BNR grammar fit on the back of a business card. It has major issues too. Yet it hasn’t successfully been replaced or augmented without creating even worse problems, like compatibility. Putting lipstick on a pig, like this suggestion for CSV, just creates incompatibility and even more headaches.
I strongly disagree based on decades of experience scripting in linux and needing to manipulate tabular data. Maybe you just don't know a lot of programmers. CSV is _extremely_ useful, if you know when and how to apply it and don't push it out of it's wheelhouse.
What a completely odd thing to say on a site like this, but alrighty.
Anyway, I'll be clear: I run a business that deals with CSVs and their oddities. I am acutely aware of how well they can work. This has nothing to do with the fact that most programmers I've met, given the choice, would prefer literally any other data format.
I'm suggesting maybe you make an assumption about what most programmers you know think about a topic you probably never specifically polled every one of them on, and that whatever numbers went into "most" there are at least 2 to add in the other column.
You are expressing your view, I am expressing mine, that is all.
Right. And that other team should not use CSV. Adding multiple column headers isn't any more of a solution because it also requires that other team to fix their shit.
In other words it's a classic multi-agent coordination problem (aka network effects), and saying "why doesn't one side unilaterally start acting differently" is a well answered question by game theory.
Name,string;Qty,units;Weight,kg;Volume,m^3;Just for the sake of it with spaces,string;No don't use comma in the header names - it doesn't make sense,string;
Why not do it the other way? Name;string,Qty;units so that if you just double click to open in excel, or use a naive reader that doesn't understand the format (which most won't), the column names are basically correct?
I think it has more to do with Excel properly recognizing the data type rather than humans doing so. It has been too long since I dealt with it to recall the troubles but I know there are things that should be easy that aren't since you can't tell Excel how to interpret the data.
In the fictional world where excel understands the second header row, it could just as easily understand a unit delimiter and units in a single header row.
CSV is the universal file format, and will be forever; nothing else is ever as readable and as portable. You can’t easily get JSON out of an Excel sheet, or easily read it in any text editor.
CSV needs just a few tweaks to make it easier to avoid comma, double-quote, and newline issues that suck a bazillion dollars of productivity out of the economy each year with broken data imports.
This is probably the lowest hanging fruit in all of technology: a small change to Excel to dramatically improve data processing / import-export for all humanity.