Modernizing AWK, a 45-year old language, by adding CSV support

jeroenjanssens · on May 12, 2022

I often use csvquote [1] whenever I need to process CSV with a command-line tool that doesn't support it. For example:

    csvquote test.csv | awk '{print $1, $2}' | csvquote -u

[1] https://github.com/dbro/csvquote

kazinator · on May 12, 2022

The article mentions a different such a filter called "csvquote", written in Go:

https://github.com/adamgordonbell/csvquote

The History section of the readme explains that they are historically related.

throwawayboise · on May 12, 2022

Yep, was going to post the same if someone hadn't already. This is the unix way. Small tools with narrow focus, strung together in pipelines.

jcelerier · on May 12, 2022

ah yes, awk's famously narrow focus, covering things such as networking (https://www.gnu.org/software/gawk/manual/html_node/TCP_002fI...) or video games (https://github.com/mikkun/AWKTC). What do you mean your blog isn't served by awk (https://www.gnu.org/software/gawk/manual/gawkinet/html_node/...) as a VRML world (https://www.gnu.org/software/gawk/manual/gawkinet/html_node/...) ?

MonkeyClub · on May 12, 2022

>> This is the unix way. Small tools with narrow focus, strung together in pipelines.

> ah yes, awk's famously narrow focus, [...]

"GNU's Not Unix"

jcelerier · on May 13, 2022

And GNU is what most people rush for installing on macOS and Windows which says a lot about the actual needs of people Vs the supposed benefits of Unix purism

dev_tty01 · on May 12, 2022

I think getting awk to recognize that a field separator within a quoted string should be ignored is a great addition. This is not inconsistent with the "unix way." Many, many unix tools recognize that a quoted string should be treated as a separate entity. The more unix-like approach would have been to force users to remove quotes if they want awk to split strings based on field separators within quotes. In hindsight, I'm surprised the quote respecting option wasn't added a long time ago.

sophacles · on May 12, 2022

awk is a tool for handling records in text. csv is a textual format for records.

I'd argue that needing a different tool for handling records in text so that you can pass it to a tool for handling records in text is a bit too far.

db65edfc7996 · on May 12, 2022

I have grown fond of using miller[0] to handle command line data processing. Handles the standard tabular formats (csv, tsv, json) and has all of the standard data cleanup options. Works on streams so (most operations) are not limited by memory.

[0]: https://github.com/johnkerl/miller

llimllib · on May 12, 2022

xsv does similar stuff for CSV, and very rapidly: https://github.com/BurntSushi/xsv

https://miller.readthedocs.io/en/latest/why/ has a nice section on "why miller":

> First: there are tools like xsv which handles CSV marvelously and jq which handles JSON marvelously, and so on -- but I over the years of my career in the software industry I've found myself, and others, doing a lot of ad-hoc things which really were fundamentally the same except for format. So the number one thing about Miller is doing common things while supporting multiple formats: (a) ingest a list of records where a record is a list of key-value pairs (however represented in the input files); (b) transform that stream of records; (c) emit the transformed stream -- either in the same format as input, or in a different format.

nsteel · on May 12, 2022

I was using xsv a lot at work (it is so much faster than csvkit) but I've recently jumped to qsv, a fork with more features.

https://github.com/jqnatividad/qsv

cb321 · on May 12, 2022

Out of curiosity, I tried these two on that same data & computer as https://news.ycombinator.com/item?id=31356573 , mlr --c2t cat takes 2.96 seconds while xsv cat rows to /dev/null takes 0.434 seconds. So, 14.8X and 2.17X slower than that c2tsv Nim program to do exactly (and only) that conversion. But, yes, yes I am sure perf varies depending on quoting/escaping/column/etc. densities.

mattewong · on May 12, 2022

This is not an outlier. `mlr` is quite slow, literally off-the-charts slow for our purposes when we benchmarked vs xsv and zsv (see https://github.com/liquidaty/zsv. disclaimer: I'm one of its authors)

cb321 · on May 12, 2022

For completeness, just one CPU/machine, but a recent checkout of zsv 2tsv (built with -O3 -march=native) on that same file/same computer seems to take 0.380 sec - almost 2X longer than c2tsv's 0.20 sec (built with -mm:arc -d:danger, gcc-11), but zsv 2tsv does seem a little faster than xsv cat rows.

OTOH, zsv count only takes 0.114 sec for me (but of course, as I'm sure you know, that also only counts rows not columns which some might complain about). { EDIT: and I've never tried to time a "parse only" mode for c2tsv. }

mattewong · on May 13, 2022

BTW does c2tsv handle multibyte UTF8, \r\n vs \n, regular escapes e.g. embedded dbl-quote/nl/lf/comma, as well as CSV garbage that doesn't exist in theory but is abundant in real world (e.g. dbl-quote inside a cell that did not start w dbl-quote, malformed UTF8, etc)? Handling those in the same way Excel does added considerable overhead to zsv (and is the reason it could only perform a subset of the processing in SIMD and had to use regular branch code for other)

cb321 · on May 14, 2022

It handles most cases, but maybe not arbitrary garbage that humans might be able to guess, but I don't think rfc4180 includes all those anyway. c2tsv is UTF8/binary agnostic. It just keys off ASCII commas, newlines, etc. Beats me how one ensures you handle anything the "same" way Excel does without actually running Excel's code somehow. { Maybe today, but next year or 10 years ago? } The little state machine could be extended, but it's hard to guess what the speed impact might be until you actually write said extensions.

From a performance perspective, strictly delimiter-separated values { again, ironically redundant ;-) } can be parsed with memchr. On Linux, memchr should be SIMD vectorized at least on x86_64 glibc via ELF 'i' symbols. So, while you give up SIMD on the "messy part" with a byte-at-a-time DFA, you regain it on the other side. (I have no idea if Apple gives you SIMD-vectorized memchr.)

Send to a file and segmentation (for parallel handling of segments) is also a simple application of memchr rather than needing an index of where rows start. You just split by bytes and find the next newline char. (Roughly). This can get you 16..128X speed-ups (today, anyway, on just one host) depending upon what you do.

Conversion to something properly byte-delimited basically restores whatever charm you might have thought ?SV had. I can only imagine a few corner cases where running directly off a complex format like quoted CSV makes sense ("tiny" data, "cannot/will not spend 2X space+must save input", "cannot/will not spend time to recompress", "running on a network fileysystem shared with those who refuse simplicity".) These cases are not common (for me). When they do happen, perf is usually limited by other things like network IO, start-up overheads, etc. Usually that little extra bit to write buffers out to a pipeline will either not matter or be outright immediately repaid in parallelism, parsing simplicity, or both.

Converting from any ASCII to even faster binary formats has a similar story, but usually with even more perf improvement (depending..) and more "choices" like how to represent strings [1]. Fully pre-parsed, the performance of conversion matters much less. (Whatever the ratio of processings per initial parse is.) Between both parallelism and ASCII->binary, however fast you make your serial zsv parser/ETL stuff, actual data analysis may still run 10,000 times slower than it could be on just 1 CPU (depending upon what throttles your workloads..you may only get 10000x for CPU local L1 resident nested loop stuff). { But we veer now toward trying to cram a databases course into an HN comment. :) And I'm probably repeating myself/others. Direct email from here may work better. }

[1] https://github.com/c-blake/nio

CRConrad · on May 20, 2022

> strictly delimiter-separated values

Sigh... If only everyone had used ASCII (and Unicode!) characters 30 and 31 for delimiters, since they are actual delimiter characters: https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text

I don't think I've ever seen them in the wild. :-(

mattewong · on May 13, 2022

thanks for mentioning. will try out. did you use the default build settings for zsv (i.e. just plain old "make install")? also do you have copy or location for the dataset you used to test on? also what hardware / os if I may ask?

pstuart · on May 12, 2022

Just starred the repo -- thanks for the tip!

mro_name · on May 12, 2022

I recently learned via https://news.ycombinator.com/item?id=31257248 that ASCII has the idea of records and fields ever since. It's just not used, but workaround CSV.

No improvement of CSV handling will ever improved on that.

belter · on May 12, 2022

Discussed here before, not everybody was seeing only positives.

"ASCII Delimited Text – Not CSV or TAB delimited text": https://news.ycombinator.com/item?id=7474600

Chris2048 · on May 12, 2022

A big problem is tooling - what they currently support, and the fact that excel will f*ck up anything.

I suppose the best way would be a suite of tools to edit them that are compatible with existing editors? TBH a great value-add for adoption would be versioning s.t. you can see (and revert) when other tools mess up your files.

cb321 · on May 12, 2022

See the "Conventions for lossless conversion to TSV" on the current version of https://en.wikipedia.org/wiki/Tab-separated_values . There are really only 3 chars to escape - the escape char, and 2 delimiters - to make everything easy to deal with (even binary data in fields - what "lossless" means here).

carapace · on May 12, 2022

I'm just gonna leave this here.

https://git.sr.ht/~sforman/sv/tree/master/item/sv.py

(Python code for ASCII-separated values. TL;DR: it's so stupid simple that I feel any objection to using this format has just got to be wrong. It's like "Hey, we don't have to keep stabbing ourselves in the face." ... "But how will we use our knives?" See? It's like that.)

a1369209993 · on May 13, 2022

> any objection to using this format

How do I edit the data file in a text editor?

More generally, how do I edit the entire data file, including field/record/etc seperators, in a editor that can display only characters that are valid content of data fields?

(Not that CSV (barring de-facto-nonstandard extensions like rfc4180's `"b CRLF bb"` and `"b""bb"`) is any good for this either, of course.)

CRConrad · on May 20, 2022

> > any objection to using this format

> How do I edit the data file in a text editor?

Use a non-sucky text editor.

For example, AFAICR NotePad++ displays those characters. Windows doesn't have any ultra-easy way to input them, but within NP++ you can copy-paste them.

a1369209993 · on May 21, 2022

> For example, AFAICR NotePad++ displays those characters.

Great! Now go back to the other half of the requirement: how do I store all characters that NotePad++ displays in the content of a data field?

CRConrad · on May 23, 2022

These delimiters, you mean? You don't. They're delimiters, they never go in a field. Bam, problem solved. (Or rather, problem didn't exist in the first place.)

mro_name · on May 14, 2022

indeed, if the text editor doesn't care, it's no help. Same for all other characters - be it tab, cr, lf, crlf, ä, and so on.

Choose your tools wisely, I'm not aware the situation.

RhysU · on May 12, 2022

Could one get away with regular AWK using these combined with a streaming to/from CSV converter?

mmcgaha · on May 12, 2022

Yes and I suspect that this is why it was never added to AWK. I,and I am sure most people, have an AWK filter to transform their csv or whatever format to a format that AWK can use with an appropriate FS and RS.

Their "but why" section should really go into more detail about why filters are not 100% if your data has the possibility of containing your preferred line/record separator.

But real world, I have never had a situation where a csv-to-awk filter (written in awk of course) did not work.

cb321 · on May 12, 2022

Not just "regular awk", but "regular most things", like head, tail, grep, etc. The "things" just need to be 8-bit/NUL handling clean and the data could even have binary fields. I mean, if there is a library like Python/Go/etc. that you especially trust to write a streaming converter then that is a fine approach, but otherwise see both https://news.ycombinator.com/item?id=31352517 and https://news.ycombinator.com/item?id=31352704 (at least!)

andylynch · on May 12, 2022

Closely related- I remain mystified as to why FIX Protocol chose to use control characters to separate fields, but used SOH rather than RS/US or the like.

kevin_thibedeau · on May 12, 2022

Because of EBCDIC.

adamgordonbell · on May 12, 2022

It's somewhat a chore to use but gawkextlib has a CSV extension. so you can do this in gawk if the extension is loaded.

    @include "csv"
    BEGIN { CSVMODE = 1 }
    { print $2 }

https://earthly.dev/blog/awk-csv/#gawkextlib

jph · on May 12, 2022

Ben this is great, thank you. Would you consider adding Unicode Separated Values (USV)?

https://github.com/sixarm/usv

USV is like CSV and simpler because of no escaping and no quoting. I can donate $50 to you or your charity of choice as a token of thanks and encouragement.

tomsmeding · on May 12, 2022

> no escaping

That makes no sense. Sure, they've chosen significantly less common separator characters than something like ','. but they are still characters that may appear in the data. How do you represent a value containing ␟ Unit Separator in USV?

In-band signalling is ever going to remain in-band signalling. And in-band signalling will need escaping.

jph · on May 12, 2022

USV has no escaping on purpose because it's simpler and faster.

USV is for the 99.999% of cases that don't embed USV Control Picture characters in the content.

If you need escaping, then I can suggest you consider USVX which is USV + extensions for escaping and other conveniences, or you can use your content's own escaping such as ampersand escaping for HTML text, or you can use any other format such as JSON, XML, SQL, etc.

tomsmeding · on May 12, 2022

> USV is for the 99.999% of cases that don't embed USV Control Picture characters in the content.

I agree that wanting to put one of these characters in a csv is super rare. The vast majority of use cases would not notice that certain inout characters are prohibited. But certain inout characters _are_ prohibited, which means that encoders must check this case, and tools using those encoders neex to handle its failure.

In a way, the fact that failure will be super rare will make it more dangerous, because people will omit these checks because "it won't happen for them" -- and then at some point it will wreak havoc.

And with USVX all we've gained is less backslashes but more bytes per separator, so for usual data that doesn't contain a comma in every field, the USVX encoding will even increase file size, without requiring less code anywhere.

I admit there is no ideal solution, though; while out-of-band signalling (i.e. length-prefixing) avoids all of these issues (which is why binary formats are universally length-prefixing formats), escaped formats are work much better for humans. And if a human wouldn't need to read it, one wouldn't use csv anyway, usually.

specialist · on May 12, 2022

You just reminded me of the argument against seat belts based on the assumption that they'll encourage riskier driving thereby leading to more deaths.

Not doing the obviously correct superior easy strategy because there's some infrequent weird corner case prevents any kind of progress. Perfection is the enemy of good enough, engineering is balancing constraints, ad nauseum...

layer8 · on May 12, 2022

This just creates another “CSV” were you don’t know which format variation was actually used.

deckard1 · on May 12, 2022

> In-band signalling is [n]ever going to remain in-band signalling

This is the root of all spreadsheet evil, right here.

TSV. Replace/strip/ban tab + newline from fields when writing. Done. If you need those characters, encode them using backslashes if you absolutely must (i.e. you're an Excel weenie doing Excel weenie things that really aren't "spreadsheet" things but since you only know of one hammer you then use that hammer to do everything)

imtringued · on May 12, 2022

Just reject nested USV documents in the spec. Use something else if you want nested data.

jerf · on May 12, 2022

That has the same in-band problems as existing specs, which is that things generating the data need to have that callout.

There are many cases where the "solution" to the CSV inband signalling problem was to just reject values with commas, because they should never come in and if they ever do they should be investigated because they weren't valid data for whatever the CSV was storing. The whole problem is that programmers don't think to do that. The siren call of the string append function is just too strong, especially when programmers don't even realize they should be resisting.

hedora · on May 12, 2022

That means it does not support binary data, which TSV does.

benhoyt · on May 12, 2022

Thanks. I haven't tried this, but it should actually work already in standard AWK input mode with FS and RS set to those Unicode separators. I'll test it tomorrow when back at my laptop.

benhoyt · on May 12, 2022

Yep, this works fine (in GoAWK, gawk, and mawk, though not in original awk):

  $ cat t.usv && echo
  id␟name␟age␞1␟Bob "Billy" Smith␟42␞2␟Jane
  Brown␟37
  $ goawk -F␟ -vRS=␞ -vOFS=, '{ print $1, $2, $3 }' t.usv 
  id,name,age
  1,Bob "Billy" Smith,42
  2,Jane
  Brown,37

I've also added explicit tests for ASCII and Unicode unit and record separators, just to ensure I don't regress: https://github.com/benhoyt/goawk/commit/215652de58f33630edb0...

cylinder714 · on May 12, 2022

More expressions of the same idea:

https://www.ccsv.io/

And this perennial favorite here at HN:

https://ronaldduncan.wordpress.com/2009/10/31/text-file-form...

I've been toying with the idea of starting a change.org campaign, asking Microsoft to add support to Excel for importing and exporting such files. I don't see any way to make any progress if Microsoft won't push the industry forward.

cb321 · on May 12, 2022

When you have "format wars", the best idea is usually to have a converter program change to the easiest to work with format - unless this incurs a space explosion as per some image/video formats.

With CSV-like data, bulk conversion from quoted-escaped RFC4180 CSV to a simpler-to-parse format is the best plan for several reasons. First, it may "catch on", help Microsoft/R/whoever embrace the format and in doing so squash many bugs written by "data analyst/scientist coders". Second, in a shell "a|b" runs programs a & b in parallel on multi-core and allow things like csv2x|head -n10000|b or popen("csv2x foo.csv"). Third, bulk conversion to a random access file where literal delimiters cannot occur as non-delimiters allows trivial file segmentation to be nCores times faster (under often satisfied assumptions). There are some D tools for this bulk convert in https://github.com/eBay/tsv-utils and a much smaller stand-alone Nim tool https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim . Optional quoting was always going to be a PITA due to its non-locality. What if there is no quote anywhere? Fourth, by using a program as the unit of modularity in this case, you make things programming language agnostic. Someone could go to town and write a pure SIMD/AVX512 converter in assembly even and solve the problem "once and for all" on a given CPU. The problem is actually just simple enough that this smells possible.

I am unaware of any "document" that "standardizes" this escaped/lossless TSV format. { Maybe call it "DSV" for delimiter separated values where "delimiters actually separate"? Ironically redundant. ;-) } Someone want to write an RFC or point to one? It can be just as "general/lossless" (see https://news.ycombinator.com/item?id=31352170).

Of course, if you are going to do a lot of data processing against some data, it is even better to parse all the way to down to binary so that you never have to parse again (well, unless you call CPUs loading registers "parsing") which is what database systems have been doing since the 1960s.

hedora · on May 12, 2022

Someone linked a wikipedia format guide for TSV, but the world seems to have settled on using the escape codes \\, \t, \n with their obvious meanings, then allowing arbitrary binary.

That should be parallelism friendly, even with UTF-8, where an ascii tab or newline byte always mean tab and newline.

cb321 · on May 12, 2022

That someone was me. I don't think of "could change at any time Wikipedia" as "as authoritative" as the "document" should be. :-) { EDIT: and I very much agree it is friendlier in almost any thinkable way except maybe Excel might not support it. Or maybe it does? You do need to unescape binary fields at a "higher level" of usage, of course, when delimiting is no longer an issue. Also, merged my posts. }

A fast streaming converter into my suggested "DSV" can also be faster end-to-end. These kinds of things can vary a lot based upon how many columns rows have. I could not find "huge.csv". So, to be specific/possibly reproducible, using the 151492068 bytes of data from here:

    http://burntsushi.net/stuff/worldcitiespop.csv

put into /dev/shm and making a symlink to huge.csv and then using the csvbench.sh in the goawk distro, I got (best of 3 elapsed times):

    Python     2.88user 0.01system 0:02.91elapsed 99%CPU (9460maxresident)k
    Goawk      1.15user 0.07system 0:01.11elapsed 110%CPU (8460maxresident)k
    Go         0.86user 0.04system 0:00.84elapsed 107%CPU (7300maxresident)k

Go vs. Goawk time ratios were similar to Ben's article but inverted. Probably a number of columns effect. frawk failed to compile for me because my Rust was not new enough, according to the error messages. On the same data:

    c2tsv+gawk 0.22user 0.04system 0:00.57elapsed 46%CPU (2680maxresident)k
    (as *c2tsv<huge.csv|gawk -F'\t' '{nfs+=NF} END{print NR, nfs}'*)
    c2tsv+mawk 0.62user 0.09system 0:00.46elapsed 155%CPU (2780maxresident)k
    (as *c2tsv<huge.csv|mawk -F'\t' '{nfs+=NF} END{print NR, nfs}'*)
    c2tsv V2   0.19user 0.07system 0:00.23elapsed 115%CPU (2680maxresident)k
    (as *c2tsv<huge.csv|wc -l*, very similar to c2tsv<huge.csv>/dev/null)
    c2tsv V3   0.19user 0.00system 0:00.20elapsed 99%CPU (2632maxresident)k
    (as *c2tsv<huge.csv>/dev/null*)

A little 100 line Nim program combined with a standard utility seems to be about 4x faster (0.84/0.23) than Go results even though said program writes out all the data again. How can this be? Well, my pipe IO is usually around 4.4 GB/s (as assessed by a dd piped to a read-only sink) while 151e6/.2=only 755 MB/s. So it need only use ~17% of available pipe BW.

rgoulter · on May 12, 2022

I'd be curious to see a comparison with the csvkit suite. https://csvkit.readthedocs.io/en/latest/index.html

benhoyt · on May 12, 2022

That's written in Python and uses the "agate" library which uses Python's built-in "csv" module. I did a couple of simple benchmarks against Python's csv module in the article: https://benhoyt.com/writings/goawk-csv/#performance (Go/GoAWK is a bit more than 3x as fast)

I also did a quick test using csvcut to pull out a single field, and compared it to GoAWK. Looks like GoAWK is about 4x as fast here:

  $ time csvcut -c agency_id huge.csv >/dev/null

  real 0m25.977s
  user 0m25.240s
  sys 0m0.424s
  $ time goawk -i csv -H -o csv '{ print @"agency_id" }' huge.csv >/dev/null

  real 0m6.584s
  user 0m7.434s
  sys 0m0.480s

rgoulter · on May 12, 2022

I meant more in terms of user/developer experience.

I imagine there's some use case where AWK will be able to come up with output that csvkit can't.

But for simple cases, csvkit's invocation is easier to remember.

benhoyt · on May 13, 2022

Ah, true -- that would be a good comparison to do, or section to add the the docs (csv.md in the repo). I'll add it to my to-do list!

parasense · on May 12, 2022

I alway just use awk to process csv files.

    awk -F '^"|","|"$|,' '{print $2,$3}' whatever.csv

The above works perfectly well, it handles quoted fields, or even just unquoted fields.... This snippet is taken from a presentation I give on AWK and BASH scripting.

That's the thing about AWK, it's already does everything. No need to extended it much at all.

CRConrad · on May 20, 2022

What about fields with "," in them?

andi999 · on May 12, 2022

Can you also set the decimal separator? Some countries use ',' in numbere like 10,5

otikik · on May 12, 2022

In my country (Spain) we traditionally use commas as a decimal separator, but I think CSV should not support this.

The way I see it, CSV's purpose is information storage and transfer, not presentation.

Presentation is where you decide the font face, the font size, cell background color (if you are viewing the file through a spreadsheet editor) etc. Number formatting belongs here.

In information transfer and storage it's much more important that the parsing/generation functions are as simple as possible. So let's go with one numeric format only. I think we should use the dot as a decimal separator since it's the most common separator in all programming languages. Maybe extend it to include exponential notation as well, because that is what other languages like json support. But that's it.

I hold the same opinion about dates tbh.

(The same goes for dates, btw - yyyy-mm-dd or death)

hedora · on May 12, 2022

For time formats, use ISO 8601, and be sure to append a Z to the end to denote UTC (zulu time), and consider rejecting timestamps without it:

https://en.m.wikipedia.org/wiki/ISO_8601

The only real problem with that format is the distasteful "T" in the middle (but, hey, at least it is whitespace-free!)

j0057 · on May 12, 2022

The other distasteful thing is getting your hands on the actual ISO 8601 specs is pretty expensive, like 350 CHF, which is weird, because so many national standards bodies gave it official blessing, and so many things in the digital universe depend on it. Maybe that 'T' is optional after all!

denisw · on May 12, 2022

There is RFC 3339, which is essentially “ISO 8601: The Good Parts” and, like all IETF RFCs, freely available: https://www.rfc-editor.org/rfc/rfc3339

andi999 · on May 12, 2022

What do you mean by csv shd not support this. Do you mean it was a historical mistake or new parsers shdnt be able to read it?

otikik · on May 12, 2022

My opinion is that it was a historical mistake. I think if the format would have been clearly specified in Microsoft Excel humanity would have saved billions in parsing labour.

On the question wether new parsers shouldn't be able to read it... if I have to build them or maintain them, then *those* parsers shouldn't be able to read it. I don't want to have to deal with the whole messiness of the thing, and I know how to change Excel's locale to generate CSV in a reasonable format. If it's something that someone else maintains and I can mostly ignore the crazy number formatting as well as other quirks, I could live with it. But I would still prefer if they didn't implement any of it, because the extra code could make the parts that I need worse (for example, by provoking a segfault, or making a line ambiguous).

Even if it's parsers that other people use and I never use, I still would prefer if they didn't do it, because that would increase the overall possibility of me having to deal with another (sigh) semicolon-separated CSV file.

ta8645 · on May 12, 2022

Can we please, please, stop that? If we go Metric, will you please standardize on period as decimal point? Shake on it. Let's make it happen.

bloak · on May 12, 2022

Somebody told me that the official international standard prefers comma; a period is just a permitted alternative. Of course, comma as decimal separator isn't very compatible with programming languages.

However, the decimal separator that I was taught at primary school and have used in handwriting all my life is neither comma, nor period, but '·'.

Frost1x · on May 12, 2022

What, why? I've seen comma decimal separated values but I thought it was a weird English thing. What do these places use to separate larger values (e.g. each 10^3 magnitude increase) to make manual counting easier.

828.497.171.614,2?

slightwinder · on May 12, 2022

Yes, using a dot for grouping is a common solution. But it seems to be fading out because of the conflict. Other solutions are a space (12 345,67) or apostrophe also known as "highcomma" (12'345,67).

franga2000 · on May 12, 2022

Exactly. The "official way" of writing big numbers around here (I know at least the balkans, germany and austria) is `1.000.000,00 €` (1 million). So what you wrote, but always at least 2 decimal places (or zero if it's not a financial document).

Chris2048 · on May 12, 2022

Doh, Having two opposite conventions is a PITA.

But if I add my 2 cents - I prefer 1,000.00 because I think it is similar to commas and full-stops in writing, where a comma is mainly a reading/speaking guide to help break up a sentence (or in this case a number) and a full stop is a much harder termination of a sentence (in the case of a number the official origin where int/non-fractional part ends).

gpvos · on May 12, 2022

What is worse is the ignorance on the English-speaking side. We have been putting up with your stuff for a few decades now, and almost the entire population needs to know about it, while we see near-zero attempts to adapt from your side.

gpvos · on May 12, 2022

I'd argue the decimal separator is more important and should therefore be more visible; the comma is larger than the dot and extends below the baseline so is easier to see.

gmfawcett · on May 12, 2022

You could argue that both ways. "Here's a big amount. That's a moderate amount. And that is a small amount, with a tiny extra bit." :)

planede · on May 12, 2022

https://en.wikipedia.org/wiki/Decimal_separator#Digit_groupi...

They use spaces or dots.

Hackbraten · on May 12, 2022

Germany has been metric forever but has the comma as a decimal separator, and the dot as a thousands group separator. It’s deeply engrained in our typography, it has typographical idioms such as ,— (comma em-dash), up to the point that even brands have adopted it in their brand name.

I’m afraid we can’t really change that on a whim. Why would we, anyway?

bmacho · on May 12, 2022

" Why would we, anyway? "

To be compatible with the world, so Germans can use international software, and could sell software to international people.

gpvos · on May 12, 2022

America is not the world; at least in the western world[0] there are more countries and languages that follow the German convention. Also, there is more to the world than software (although that seems to be changing).

[0] Yes, that shows my limited perception.

Mordisquitos · on May 12, 2022

Call me crazy, but I'm under the impression that Germans do use international software, and that Germany may even be home to the fourth largest software company in the world.

CRConrad · on May 20, 2022

And in a largely successful attempt at German self-parody, that software is hugely complex and laborious; in short, over-engineered.

It is the only software I know of where you regularly hear of "implementation" -- which in this context means installing the software, not writing it -- projects being planned to take years, actually taking twice as long, and sometimes being abandoned altogether when the prospective client finds out it just can't be done.

I mean, it is SAP you're talking about, right?

slightwinder · on May 12, 2022

There are more countries using the comma than the dot. Why should the majority follow a minority in that case?

alisonatwork · on May 12, 2022

Are you certain that more people, numerically speaking, use the comma as a decimal separator? India and China do not, although admittedly India has another decimal grouping system.

slightwinder · on May 12, 2022

Yes, but does the number of people matter in that case?

alisonatwork · on May 12, 2022

If we are aiming to create a common international system of representing the fractional part of a decimal number, it makes sense to start with the system that is already in use by the most people.

slightwinder · on May 12, 2022

No, it does not. The Countries individual choice has more weight than their number of citizens. It's probably also more work to change laws and regulations, than letting people adapt to the changes.

This is basically the classical problem of democratic systems, when they need to balance the interests of different sized groups. Numbers alone don't make a fair solution. And unless you have the power to force them, you will not convince everyone to follow you just by arguing with numbers anyway.

alisonatwork · on May 13, 2022

You are the one who proposed that the majority should not follow the minority, but now you are also saying that the minority should not follow the majority either. If your point is that nobody should follow anybody because freedom is more important than standardization, that's fine, but then it would have been clearer to just say that in the first place.

In my opinion, when we are talking about a data interchange format like CSV, having a simple, common format would be far more practical and efficient than allowing each country to decide for itself its own standard. Having dealt with exactly this problem in a global SaaS product where a minority of clients submitted CSV files with commas for decimal separators, I can say it would have made the parsing code a lot simpler and more robust if our system (and countless others like it) did not need to build in exceptions for this minority use case.

gpvos · on May 12, 2022

No. I agree it should be universally used for machine-readable data formats, but for human consumption there is no need to be anglo-centric.

You guys are already slowly encroaching on milliards and billiards in other languages with your illogical short scale numbers.

thechao · on May 12, 2022

Don't worry, some of us are still fighting for the gloriously base-2 US customary system; base-10 is only for people who don't float. And the 30cm foot.

gpvos · on May 12, 2022

Well, in a way a base-2 system would have been far ahead of its time... it's not consistently base-2 though.

Findecanor · on May 12, 2022

It is not a matter of what I use. It is a matter of what the data uses.

malkocoglu · on May 12, 2022

Modernizing ... by adding CSV support !?!

benhoyt · on May 12, 2022

Fair comeback. I think of CSV as modern, but Wikipedia tells me it's almost as old as AWK (depending on how you count). It seems to me it's used more heavily now as an exchange format, compared to say 15-20 years ago, but I could be wrong.

imtringued · on May 12, 2022

Every CSV "document" is its own file format so yes, it is as modern as your software.

londons_explore · on May 12, 2022

JSON is an exchange format... sqlite is an exchange format... even protocol buffers are an exchange format...

CSV is only an exchange format if there is no user generated strings in the data... If there are, then you'll almost certainly screw up the encoding when someones name has a newline or comma or speech mark in it, or some obscure unicode etc. Even moreso if awk is part of your toolkit.

benhoyt · on May 12, 2022

That may have been more true years ago, but now quoting is pretty well defined with RFC 4180, and most tools seem to produce and consume RFC 4180-compatible CSV (which properly handles commas, quotes, and even newlines in fields). That said, there still are too many non-standard or quirky CSV files out there.

dspillett · on May 12, 2022

> and most tools seem to produce and consume RFC 4180-compatible CSV

Laughs in SSIS…

There are some significant tools (or common add-ins for them) that don't entirely respect RFC4180. Though I see few files that breach it these days, thre are tools that break with conforming files (looking at you, Excel, trying to be clever about anything isn't conclusively provable not to be a date).

Our clients use it all the time, to the point where we'd lose sales if we didn't support it, but CSV is far from a safe way to transport data IMO. Each time a new requirement to deal with CSV comes in I treat it as a custom format that may or may not be something like RFC4180.

recursive · on May 12, 2022

I see lot of CSVs generated with LF instead of CRLF as line endings. Blindly following the RFC is still probably not advised.

hawski · on May 12, 2022

It is a nice addition, but I would like to see this taken further - structural regular expression awk. It is waiting to be implemented for 35 years now.

asicsp · on May 12, 2022

>A big thank-you to the library of the University of Antwerp, who sponsored this feature. They’re one of two major teams or projects I know of that use GoAWK – the other one is the Benthos stream processor.

That's great to hear.

Are you planning to add support for xml, json, etc next? Something like Python's `json` module that gives you a dictionary object.

benhoyt · on May 13, 2022

I'm not considering adding general structured formats like XML or JSON, as they don't fit AWK's "fields and records" model or its simplistic data structure (the associative array) very well. However, I have considered adding JSON Lines support, where each line is a record, and fields are indexable using the new @"named-field" syntax (possibly nested like @"foo.bar").

asicsp · on May 13, 2022

Thanks, JSON Lines support would be welcome too :)

rufugee · on May 12, 2022

What’s the best resource for learning modern awk these days? I’ve used it for decades, but only via memorized snippets…

chasil · on May 12, 2022

The full PDF of "The AWK Programming Language" has been online for some time.

Chapter 2 is the complete description of the language in 40 pages.

I think this link works...

https://ia803404.us.archive.org/0/items/pdfy-MgN0H1joIoDVoIC...

altairprime · on May 12, 2022

For non-awk tools, csvformat (from csvkit) will unquote and re-delimeter a CSV file (-D\034 -U -B) into something that UNIX pipes can handle (cut -d\034, etc). It’s worth setting up as an alias, and you can store \034 in $D or whatever for convenience.

adolph · on May 12, 2022

For anything down and dirty, what's wrong with -F'"'? For anything fancy there are plenty of things like the below.

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

includes csv to tsv: https://github.com/eBay/tsv-utils

HT: https://simonwillison.net/

gpvos · on May 12, 2022

During a recent HN discussion on pipes and text versus structured objects to transfer data between programs, I started wondering if CSV wouldn't be a nice middle ground.

jerf · on May 12, 2022

JSON Lines would probably beat that out: https://jsonlines.org/

I phrase that carefully. "Better"? "Worse"? Very subjective. But in the current environment, "likely to beat out CSV"? Oh, most definitely yes.

A solid upside is a single encoding story for JSON. CSV is a mess and can't be un-messed now. Size bloat from endless repetition of the object keys is a significant disadvantage, though.

carlmr · on May 12, 2022

I still think objects are great, but PowerShell makes it so hard to deal with them.

I think F#-interactive (FSI) with its Hindley-Milner type-inference, would have been a much better base for a shell.

gpvos · on May 12, 2022

I'm not familiar with F#, but I do hate CSV tools that try type inference on data; in my opinion the csvkit tools should have the -I option on by default.

carlmr · on May 12, 2022

F# does type inference like this:

    let x = 1;

x is now an integer. Now you can't do

    printfn "%s" x;

Only

    printfn "%i" x;

So it has strong static typing, but most of the time you don't need to be explicit about them. It can even infer function types.

If you use FSI (F# interactive( it will always print the signatures in between, so that it's really easy to explore interfaces.

majkinetor · on May 12, 2022

> I still think objects are great, but PowerShell makes it so hard to deal with them.

LOL, this is amazing...

carlmr · on May 12, 2022

The thing is, I've used .NET a lot, and C# and F# I can code in my sleep. The same object system, integrated in PowerShell, makes it really hard to use.

majkinetor · on May 12, 2022

That is even more amazing.

What makes it "so hard". $object.Proprety or $object.Method() is the same. new versus new-object? [type] vs type ?

ognyankulev · on May 12, 2022

Instead of yet another limited parser, it would be best if universal tabular data parsing is supported by allowing one to specify all important parsing parameters, as described in https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#di...

junon · on May 12, 2022

Those parameters are URL/Header/descriptor parameters. They don't live in the CSV itself.

ognyankulev · on May 12, 2022

I meant to have these parameters as awk options, supplied on command line, from envvar, or maybe even as awk variables.

tyingq · on May 12, 2022

Gnu awk also has a csv extension that comes with gawkextlib. I think it may even be installed on many Linux distros by default.

torginus · on May 12, 2022

I can't tell whether the UNIX people have lost their way, or just the demands of modern shell scripts cannot be met by typical shell philosophy - that is, piping together the output of small, orthogonal utilities.

The emergence and constantly increasing complexity of these small, bespoke DSLs like this or jq does not inspire confidence in me.

tonyg · on May 12, 2022

> demands of modern shell scripts cannot be met by typical shell philosophy

That. Pipes and unstructured binary data isn't compositional enough, making the divide between the kinds of things you can express in the language you use to write a stage in a pipeline and the kinds of things you can express by building a pipeline too large.

anthk · on May 12, 2022

A single tool to parse JSON and pipe it? That's Unix too.

jjtheblunt · on May 12, 2022

you made me think a possible corollary (?) question would be if the json people don't perfectly overlap with the unix people.

uhtred · on May 12, 2022

Why not use FPAT: https://www.gnu.org/software/gawk/manual/html_node/Splitting...

forgotpwd16 · on May 12, 2022

From same documentation but the "more CSV" link:

>In general, using FPAT to do your own CSV parsing is like having a bed with a blanket that’s not quite big enough. There’s always a corner that isn’t covered. We recommend, instead, that you use Manuel Collado’s CSVMODE library for gawk.

wef · on May 13, 2022

Here's another library I've been using for several years:

http://lorance.freeshell.org/csv/

forgotpwd16 · on May 12, 2022

A good and useful addition. There's a mention to CSVMODE, a gawk library. I wonder if it could be extended to support the functionality that goawk's `-i csv` has.

motohagiography · on May 12, 2022

Thank you! In non-technical environments, shell scripting with awk is a superpower, and it's almost always on csv data.

mattewong · on May 11, 2022

thanks for this. am looking at the benchmarks. how do I get huge.csv? Don't see how to fetch or generate

mattewong · on May 11, 2022

FYI I ran on the worldcities data at https://github.com/petewarden/dstkdata (credit to xsv for choosing that dataset) against https://github.com/BurntSushi/xsv and https://github.com/liquidaty/zsv (full disclosure: I am one of the zsv authors). Here's what I got.

fastest to slowest: zsv (0.07), xsv (0.16), goawk (0.42), python (~1.6)

Obviously, does not tell the whole story as this test was limited to "count" and an interpreted language is expected to always be slower compared to a precompiled command, but, it might be relevant to a user deciding what tool to use. Also, might be instructive as to room for improvement in the go code (or possibly the go code could use the c lib)-- I note that even if the goawk command is '{}' the runtime is still about the same.

full results:

---

goawk:

1000001 7000007 real 0m0.435s user 0m0.435s sys 0m0.031s

1000001 7000007 real 0m0.413s user 0m0.419s sys 0m0.024s

1000001 7000007 real 0m0.425s user 0m0.430s sys 0m0.024s

xsv:

1000000 real 0m0.157s user 0m0.141s sys 0m0.013s

1000000 real 0m0.156s user 0m0.141s sys 0m0.012s

1000000 real 0m0.158s user 0m0.142s sys 0m0.013s

zsv:

1000000 real 0m0.066s user 0m0.053s sys 0m0.010s

1000000 real 0m0.077s user 0m0.060s sys 0m0.012s

1000000 real 0m0.069s user 0m0.056s sys 0m0.010s

python:

1000001 7000007 real 0m1.589s user 0m1.553s sys 0m0.026s

1000001 7000007 real 0m1.583s user 0m1.550s sys 0m0.025s

1000001 7000007 real 0m2.122s user 0m1.675s sys 0m0.037s

---

The script for this was:

---

echo 'goawk:'

(time goawk -i csv '{ w+=NF } END { print NR, w }' < worldcitiespop_mil.csv) 2>&1 | xargs

echo 'xsv:'

(time xsv count < worldcitiespop_mil.csv) 2>&1 | xargs

echo 'zsv:'

(time zsv count < worldcitiespop_mil.csv) 2>&1 | xargs

echo 'python:'

(time python3 count.py < worldcitiespop_mil.csv) 2>&1 | xargs

benhoyt · on May 11, 2022

Thanks for that! zsv looks amazing. Yeah, it's definitely going to whip GoAWK, what with SIMD parsing and careful memory handling. I've done a couple of basic things for GoAWK's CSV performance, but haven't profiled or looked at allocation bottlenecks (absolute performance definitely wasn't my first focus).

Yeah, sorry about huge.csv -- I found it online somewhere originally by searching for something like "large csv example", but can't for the life of me find it now. It's a monster 1.5GB file with 286 columns including quoted fields (whereas worldcitiespop only has a few columns and it doesn't look like it has quoted fields). I can upload to a file transfer service and send a link to you if you want ... though I should really update my benchmarks to use an easily-downloadable file instead.

mattewong · on May 12, 2022

Not a problem at all. Nice work with goawk!

benhoyt · on May 13, 2022

I've now updated the benchmarks to avoid huge.csv: the write benchmarks now write a big 1GB, 20-column CSV and the read benchmarks use that same CSV: https://github.com/benhoyt/goawk/commit/07eb4505a7f64cffceeb...

skanga · on May 12, 2022

I always use mawk for its performance. This may be worth a try ...

benhoyt · on May 12, 2022

If you're just using AWK features, mawk is still the fastest. GoAWK is faster than awk in most cases, on a par with gawk, but still not as fast as mawk (see https://benhoyt.com/writings/goawk-compiler-vm/#virtual-mach...) ... except for scripts that make heavy use of regexen. Unfortunately Go's regexp package is still quite slow. You could also try frawk, which is a JIT-optimized AWK written in Rust -- it's really fast, and shares some of the CSV features (but not the @"named-field" syntax).

But for most everyday usage, even for large inputs, GoAWK's performance is quite sufficient. The CSV support, and its use as a Go library -- they're more important to me than raw speed at this point.

skanga · on May 14, 2022

Thanks for such a detailed and informative answer. Much appreciated.

igtztorrero · on May 12, 2022

one step further towards GoUnix, Gonux or any name Gophers like!

anthk · on May 12, 2022

Just use 9front. The plan9 C borrows a lot of philosophy from Golang (well, the reverse), and it's from the same creators.

demouser345 · on May 12, 2022

no thank you. i will vehemently fight this hivemind

WFHRenaissance · on May 12, 2022

Always had CSV support, it's called `awk --field-separator=","`.

gshubert17 · on May 12, 2022

I think the article already showed that with the example -F,

The problem is commas inside quoted strings, as in the example "Smith, Bob".

I believe the field-separator option on awk will break the quoted string at the interior comma.

asdff · on May 12, 2022

You could just replace those commas within the quote with some other string or remove them if you wanted with some awk or sed. Here is a somewhat relevant example from stackoverflow:

awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' infile

1. https://unix.stackexchange.com/questions/48672/remove-comma-...

WFHRenaissance · on May 12, 2022

LOL - you guys actually read the articles?

__jem · on May 12, 2022

I see you like to live dangerously.

WFHRenaissance · on May 12, 2022

AWK is my hammer and I will abuse it to every end

w0de0 · on May 13, 2022

-F ','