And GNU is what most people rush for installing on macOS and Windows which says a lot about the actual needs of people Vs the supposed benefits of Unix purism
I think getting awk to recognize that a field separator within a quoted string should be ignored is a great addition. This is not inconsistent with the "unix way." Many, many unix tools recognize that a quoted string should be treated as a separate entity. The more unix-like approach would have been to force users to remove quotes if they want awk to split strings based on field separators within quotes. In hindsight, I'm surprised the quote respecting option wasn't added a long time ago.
I have grown fond of using miller[0] to handle command line data processing. Handles the standard tabular formats (csv, tsv, json) and has all of the standard data cleanup options. Works on streams so (most operations) are not limited by memory.
> First: there are tools like xsv which handles CSV marvelously and jq which handles JSON marvelously, and so on -- but I over the years of my career in the software industry I've found myself, and others, doing a lot of ad-hoc things which really were fundamentally the same except for format. So the number one thing about Miller is doing common things while supporting multiple formats: (a) ingest a list of records where a record is a list of key-value pairs (however represented in the input files); (b) transform that stream of records; (c) emit the transformed stream -- either in the same format as input, or in a different format.
Out of curiosity, I tried these two on that same data & computer as https://news.ycombinator.com/item?id=31356573 , mlr --c2t cat takes 2.96 seconds while xsv cat rows to /dev/null takes 0.434 seconds. So, 14.8X and 2.17X slower than that c2tsv Nim program to do exactly (and only) that conversion. But, yes, yes I am sure perf varies depending on quoting/escaping/column/etc. densities.
This is not an outlier. `mlr` is quite slow, literally off-the-charts slow for our purposes when we benchmarked vs xsv and zsv (see https://github.com/liquidaty/zsv. disclaimer: I'm one of its authors)
For completeness, just one CPU/machine, but a recent checkout of zsv 2tsv (built with -O3 -march=native) on that same file/same computer seems to take 0.380 sec - almost 2X longer than c2tsv's 0.20 sec (built with -mm:arc -d:danger, gcc-11), but zsv 2tsv does seem a little faster than xsv cat rows.
OTOH, zsv count only takes 0.114 sec for me (but of course, as I'm sure you know, that also only counts rows not columns which some might complain about). { EDIT: and I've never tried to time a "parse only" mode for c2tsv. }
BTW does c2tsv handle multibyte UTF8, \r\n vs \n, regular escapes e.g. embedded dbl-quote/nl/lf/comma, as well as CSV garbage that doesn't exist in theory but is abundant in real world (e.g. dbl-quote inside a cell that did not start w dbl-quote, malformed UTF8, etc)? Handling those in the same way Excel does added considerable overhead to zsv (and is the reason it could only perform a subset of the processing in SIMD and had to use regular branch code for other)
It handles most cases, but maybe not arbitrary garbage that humans might be able to guess, but I don't think rfc4180 includes all those anyway. c2tsv is UTF8/binary agnostic. It just keys off ASCII commas, newlines, etc. Beats me how one ensures you handle anything the "same" way Excel does without actually running Excel's code somehow. { Maybe today, but next year or 10 years ago? } The little state machine could be extended, but it's hard to guess what the speed impact might be until you actually write said extensions.
From a performance perspective, strictly delimiter-separated values { again, ironically redundant ;-) } can be parsed with memchr. On Linux, memchr should be SIMD vectorized at least on x86_64 glibc via ELF 'i' symbols. So, while you give up SIMD on the "messy part" with a byte-at-a-time DFA, you regain it on the other side. (I have no idea if Apple gives you SIMD-vectorized memchr.)
Send to a file and segmentation (for parallel handling of segments) is also a simple application of memchr rather than needing an index of where rows start. You just split by bytes and find the next newline char. (Roughly). This can get you 16..128X speed-ups (today, anyway, on just one host) depending upon what you do.
Conversion to something properly byte-delimited basically restores whatever charm you might have thought ?SV had. I can only imagine a few corner cases where running directly off a complex format like quoted CSV makes sense ("tiny" data, "cannot/will not spend 2X space+must save input", "cannot/will not spend time to recompress", "running on a network fileysystem shared with those who refuse simplicity".) These cases are not common (for me). When they do happen, perf is usually limited by other things like network IO, start-up overheads, etc. Usually that little extra bit to write buffers out to a pipeline will either not matter or be outright immediately repaid in parallelism, parsing simplicity, or both.
Converting from any ASCII to even faster binary formats has a similar story, but usually with even more perf improvement (depending..) and more "choices" like how to represent strings [1]. Fully pre-parsed, the performance of conversion matters much less. (Whatever the ratio of processings per initial parse is.) Between both parallelism and ASCII->binary, however fast you make your serial zsv parser/ETL stuff, actual data analysis may still run 10,000 times slower than it could be on just 1 CPU (depending upon what throttles your workloads..you may only get 10000x for CPU local L1 resident nested loop stuff). { But we veer now toward trying to cram a databases course into an HN comment. :) And I'm probably repeating myself/others. Direct email from here may work better. }
thanks for mentioning. will try out. did you use the default build settings for zsv (i.e. just plain old "make install")? also do you have copy or location for the dataset you used to test on? also what hardware / os if I may ask?
A big problem is tooling - what they currently support, and the fact that excel will f*ck up anything.
I suppose the best way would be a suite of tools to edit them that are compatible with existing editors? TBH a great value-add for adoption would be versioning s.t. you can see (and revert) when other tools mess up your files.
See the "Conventions for lossless conversion to TSV" on the current version of https://en.wikipedia.org/wiki/Tab-separated_values . There are really only 3 chars to escape - the escape char, and 2 delimiters - to make everything easy to deal with (even binary data in fields - what "lossless" means here).
(Python code for ASCII-separated values. TL;DR: it's so stupid simple that I feel any objection to using this format has just got to be wrong. It's like "Hey, we don't have to keep stabbing ourselves in the face." ... "But how will we use our knives?" See? It's like that.)
More generally, how do I edit the entire data file, including field/record/etc seperators, in a editor that can display only characters that are valid content of data fields?
(Not that CSV (barring de-facto-nonstandard extensions like rfc4180's `"b CRLF bb"` and `"b""bb"`) is any good for this either, of course.)
For example, AFAICR NotePad++ displays those characters. Windows doesn't have any ultra-easy way to input them, but within NP++ you can copy-paste them.
These delimiters, you mean? You don't. They're delimiters, they never go in a field. Bam, problem solved. (Or rather, problem didn't exist in the first place.)
Yes and I suspect that this is why it was never added to AWK. I,and I am sure most people, have an AWK filter to transform their csv or whatever format to a format that AWK can use with an appropriate FS and RS.
Their "but why" section should really go into more detail about why filters are not 100% if your data has the possibility of containing your preferred line/record separator.
But real world, I have never had a situation where a csv-to-awk filter (written in awk of course) did not work.
Not just "regular awk", but "regular most things", like head, tail, grep, etc. The "things" just need to be 8-bit/NUL handling clean and the data could even have binary fields. I mean, if there is a library like Python/Go/etc. that you especially trust to write a streaming converter then that is a fine approach, but otherwise see both https://news.ycombinator.com/item?id=31352517 and https://news.ycombinator.com/item?id=31352704 (at least!)
Closely related- I remain mystified as to why FIX Protocol chose to use control characters to separate fields, but used SOH rather than RS/US or the like.
USV is like CSV and simpler because of no escaping and no quoting. I can donate $50 to you or your charity of choice as a token of thanks and encouragement.
That makes no sense. Sure, they've chosen significantly less common separator characters than something like ','. but they are still characters that may appear in the data. How do you represent a value containing ␟ Unit Separator in USV?
In-band signalling is ever going to remain in-band signalling. And in-band signalling will need escaping.
USV has no escaping on purpose because it's simpler and faster.
USV is for the 99.999% of cases that don't embed USV Control Picture characters in the content.
If you need escaping, then I can suggest you consider USVX which is USV + extensions for escaping and other conveniences, or you can use your content's own escaping such as ampersand escaping for HTML text, or you can use any other format such as JSON, XML, SQL, etc.
> USV is for the 99.999% of cases that don't embed USV Control Picture characters in the content.
I agree that wanting to put one of these characters in a csv is super rare. The vast majority of use cases would not notice that certain inout characters are prohibited. But certain inout characters _are_ prohibited, which means that encoders must check this case, and tools using those encoders neex to handle its failure.
In a way, the fact that failure will be super rare will make it more dangerous, because people will omit these checks because "it won't happen for them" -- and then at some point it will wreak havoc.
And with USVX all we've gained is less backslashes but more bytes per separator, so for usual data that doesn't contain a comma in every field, the USVX encoding will even increase file size, without requiring less code anywhere.
I admit there is no ideal solution, though; while out-of-band signalling (i.e. length-prefixing) avoids all of these issues (which is why binary formats are universally length-prefixing formats), escaped formats are work much better for humans. And if a human wouldn't need to read it, one wouldn't use csv anyway, usually.
You just reminded me of the argument against seat belts based on the assumption that they'll encourage riskier driving thereby leading to more deaths.
Not doing the obviously correct superior easy strategy because there's some infrequent weird corner case prevents any kind of progress. Perfection is the enemy of good enough, engineering is balancing constraints, ad nauseum...
> In-band signalling is [n]ever going to remain in-band signalling
This is the root of all spreadsheet evil, right here.
TSV. Replace/strip/ban tab + newline from fields when writing. Done. If you need those characters, encode them using backslashes if you absolutely must (i.e. you're an Excel weenie doing Excel weenie things that really aren't "spreadsheet" things but since you only know of one hammer you then use that hammer to do everything)
That has the same in-band problems as existing specs, which is that things generating the data need to have that callout.
There are many cases where the "solution" to the CSV inband signalling problem was to just reject values with commas, because they should never come in and if they ever do they should be investigated because they weren't valid data for whatever the CSV was storing. The whole problem is that programmers don't think to do that. The siren call of the string append function is just too strong, especially when programmers don't even realize they should be resisting.
Thanks. I haven't tried this, but it should actually work already in standard AWK input mode with FS and RS set to those Unicode separators. I'll test it tomorrow when back at my laptop.
I've been toying with the idea of starting a change.org campaign, asking Microsoft to add support to Excel for importing and exporting such files. I don't see any way to make any progress if Microsoft won't push the industry forward.
When you have "format wars", the best idea is usually to have a converter program change to the easiest to work with format - unless this incurs a space explosion as per some image/video formats.
With CSV-like data, bulk conversion from quoted-escaped RFC4180 CSV to a simpler-to-parse format is the best plan for several reasons. First, it may "catch on", help Microsoft/R/whoever embrace the format and in doing so squash many bugs written by "data analyst/scientist coders". Second, in a shell "a|b" runs programs a & b in parallel on multi-core and allow things like csv2x|head -n10000|b or popen("csv2x foo.csv"). Third, bulk conversion to a random access file where literal delimiters cannot occur as non-delimiters allows trivial file segmentation to be nCores times faster (under often satisfied assumptions). There are some D tools for this bulk convert in https://github.com/eBay/tsv-utils and a much smaller stand-alone Nim tool https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim . Optional quoting was always going to be a PITA due to its non-locality. What if there is no quote anywhere? Fourth, by using a program as the unit of modularity in this case, you make things programming language agnostic. Someone could go to town and write a pure SIMD/AVX512 converter in assembly even and solve the problem "once and for all" on a given CPU. The problem is actually just simple enough that this smells possible.
I am unaware of any "document" that "standardizes" this escaped/lossless TSV format. { Maybe call it "DSV" for delimiter separated values where "delimiters actually separate"? Ironically redundant. ;-) } Someone want to write an RFC or point to one? It can be just as "general/lossless" (see https://news.ycombinator.com/item?id=31352170).
Of course, if you are going to do a lot of data processing against some data, it is even better to parse all the way to down to binary so that you never have to parse again (well, unless you call CPUs loading registers "parsing") which is what database systems have been doing since the 1960s.
Someone linked a wikipedia format guide for TSV, but the world seems to have settled on using the escape codes \\, \t, \n with their obvious meanings, then allowing arbitrary binary.
That should be parallelism friendly, even with UTF-8, where an ascii tab or newline byte always mean tab and newline.
That someone was me. I don't think of "could change at any time Wikipedia" as "as authoritative" as the "document" should be. :-) { EDIT: and I very much agree it is friendlier in almost any thinkable way except maybe Excel might not support it. Or maybe it does? You do need to unescape binary fields at a "higher level" of usage, of course, when delimiting is no longer an issue. Also, merged my posts. }
A fast streaming converter into my suggested "DSV" can also be faster end-to-end. These kinds of things can vary a lot based upon how many columns rows have. I could not find "huge.csv". So, to be specific/possibly reproducible, using the 151492068 bytes of data from here:
http://burntsushi.net/stuff/worldcitiespop.csv
put into /dev/shm and making a symlink to huge.csv and then using the csvbench.sh in the goawk distro, I got (best of 3 elapsed times):
Go vs. Goawk time ratios were similar to Ben's article but inverted. Probably a number of columns effect. frawk failed to compile for me because my Rust was not new enough, according to the error messages. On the same data:
A little 100 line Nim program combined with a standard utility seems to be about 4x faster (0.84/0.23) than Go results even though said program writes out all the data again. How can this be? Well, my pipe IO is usually around 4.4 GB/s (as assessed by a dd piped to a read-only sink) while 151e6/.2=only 755 MB/s. So it need only use ~17% of available pipe BW.
That's written in Python and uses the "agate" library which uses Python's built-in "csv" module. I did a couple of simple benchmarks against Python's csv module in the article: https://benhoyt.com/writings/goawk-csv/#performance (Go/GoAWK is a bit more than 3x as fast)
I also did a quick test using csvcut to pull out a single field, and compared it to GoAWK. Looks like GoAWK is about 4x as fast here:
$ time csvcut -c agency_id huge.csv >/dev/null
real 0m25.977s
user 0m25.240s
sys 0m0.424s
$ time goawk -i csv -H -o csv '{ print @"agency_id" }' huge.csv >/dev/null
real 0m6.584s
user 0m7.434s
sys 0m0.480s
The above works perfectly well, it handles quoted fields, or even just unquoted fields.... This snippet is taken from a presentation I give on AWK and BASH scripting.
That's the thing about AWK, it's already does everything. No need to extended it much at all.
In my country (Spain) we traditionally use commas as a decimal separator, but I think CSV should not support this.
The way I see it, CSV's purpose is information storage and transfer, not presentation.
Presentation is where you decide the font face, the font size, cell background color (if you are viewing the file through a spreadsheet editor) etc. Number formatting belongs here.
In information transfer and storage it's much more important that the parsing/generation functions are as simple as possible. So let's go with one numeric format only. I think we should use the dot as a decimal separator since it's the most common separator in all programming languages. Maybe extend it to include exponential notation as well, because that is what other languages like json support. But that's it.
I hold the same opinion about dates tbh.
(The same goes for dates, btw - yyyy-mm-dd or death)
The other distasteful thing is getting your hands on the actual ISO 8601 specs is pretty expensive, like 350 CHF, which is weird, because so many national standards bodies gave it official blessing, and so many things in the digital universe depend on it. Maybe that 'T' is optional after all!
My opinion is that it was a historical mistake. I think if the format would have been clearly specified in Microsoft Excel humanity would have saved billions in parsing labour.
On the question wether new parsers shouldn't be able to read it... if I have to build them or maintain them, then *those* parsers shouldn't be able to read it. I don't want to have to deal with the whole messiness of the thing, and I know how to change Excel's locale to generate CSV in a reasonable format. If it's something that someone else maintains and I can mostly ignore the crazy number formatting as well as other quirks, I could live with it. But I would still prefer if they didn't implement any of it, because the extra code could make the parts that I need worse (for example, by provoking a segfault, or making a line ambiguous).
Even if it's parsers that other people use and I never use, I still would prefer if they didn't do it, because that would increase the overall possibility of me having to deal with another (sigh) semicolon-separated CSV file.
Somebody told me that the official international standard prefers comma; a period is just a permitted alternative. Of course, comma as decimal separator isn't very compatible with programming languages.
However, the decimal separator that I was taught at primary school and have used in handwriting all my life is neither comma, nor period, but '·'.
What, why? I've seen comma decimal separated values but I thought it was a weird English thing. What do these places use to separate larger values (e.g. each 10^3 magnitude increase) to make manual counting easier.
Yes, using a dot for grouping is a common solution. But it seems to be fading out because of the conflict. Other solutions are a space (12 345,67) or apostrophe also known as "highcomma" (12'345,67).
Exactly. The "official way" of writing big numbers around here (I know at least the balkans, germany and austria) is `1.000.000,00 €` (1 million). So what you wrote, but always at least 2 decimal places (or zero if it's not a financial document).
But if I add my 2 cents - I prefer 1,000.00 because I think it is similar to commas and full-stops in writing, where a comma is mainly a reading/speaking guide to help break up a sentence (or in this case a number) and a full stop is a much harder termination of a sentence (in the case of a number the official origin where int/non-fractional part ends).
What is worse is the ignorance on the English-speaking side. We have been putting up with your stuff for a few decades now, and almost the entire population needs to know about it, while we see near-zero attempts to adapt from your side.
I'd argue the decimal separator is more important and should therefore be more visible; the comma is larger than the dot and extends below the baseline so is easier to see.
Germany has been metric forever but has the comma as a decimal separator, and the dot as a thousands group separator. It’s deeply engrained in our typography, it has typographical idioms such as ,— (comma em-dash), up to the point that even brands have adopted it in their brand name.
I’m afraid we can’t really change that on a whim. Why would we, anyway?
America is not the world; at least in the western world[0] there are more countries and languages that follow the German convention. Also, there is more to the world than software (although that seems to be changing).
Call me crazy, but I'm under the impression that Germans do use international software, and that Germany may even be home to the fourth largest software company in the world.
And in a largely successful attempt at German self-parody, that software is hugely complex and laborious; in short, over-engineered.
It is the only software I know of where you regularly hear of "implementation" -- which in this context means installing the software, not writing it -- projects being planned to take years, actually taking twice as long, and sometimes being abandoned altogether when the prospective client finds out it just can't be done.
Are you certain that more people, numerically speaking, use the comma as a decimal separator? India and China do not, although admittedly India has another decimal grouping system.
If we are aiming to create a common international system of representing the fractional part of a decimal number, it makes sense to start with the system that is already in use by the most people.
No, it does not. The Countries individual choice has more weight than their number of citizens. It's probably also more work to change laws and regulations, than letting people adapt to the changes.
This is basically the classical problem of democratic systems, when they need to balance the interests of different sized groups. Numbers alone don't make a fair solution. And unless you have the power to force them, you will not convince everyone to follow you just by arguing with numbers anyway.
You are the one who proposed that the majority should not follow the minority, but now you are also saying that the minority should not follow the majority either. If your point is that nobody should follow anybody because freedom is more important than standardization, that's fine, but then it would have been clearer to just say that in the first place.
In my opinion, when we are talking about a data interchange format like CSV, having a simple, common format would be far more practical and efficient than allowing each country to decide for itself its own standard. Having dealt with exactly this problem in a global SaaS product where a minority of clients submitted CSV files with commas for decimal separators, I can say it would have made the parsing code a lot simpler and more robust if our system (and countless others like it) did not need to build in exceptions for this minority use case.
Don't worry, some of us are still fighting for the gloriously base-2 US customary system; base-10 is only for people who don't float. And the 30cm foot.
Fair comeback. I think of CSV as modern, but Wikipedia tells me it's almost as old as AWK (depending on how you count). It seems to me it's used more heavily now as an exchange format, compared to say 15-20 years ago, but I could be wrong.
JSON is an exchange format... sqlite is an exchange format... even protocol buffers are an exchange format...
CSV is only an exchange format if there is no user generated strings in the data... If there are, then you'll almost certainly screw up the encoding when someones name has a newline or comma or speech mark in it, or some obscure unicode etc. Even moreso if awk is part of your toolkit.
That may have been more true years ago, but now quoting is pretty well defined with RFC 4180, and most tools seem to produce and consume RFC 4180-compatible CSV (which properly handles commas, quotes, and even newlines in fields). That said, there still are too many non-standard or quirky CSV files out there.
> and most tools seem to produce and consume RFC 4180-compatible CSV
Laughs in SSIS…
There are some significant tools (or common add-ins for them) that don't entirely respect RFC4180. Though I see few files that breach it these days, thre are tools that break with conforming files (looking at you, Excel, trying to be clever about anything isn't conclusively provable not to be a date).
Our clients use it all the time, to the point where we'd lose sales if we didn't support it, but CSV is far from a safe way to transport data IMO. Each time a new requirement to deal with CSV comes in I treat it as a custom format that may or may not be something like RFC4180.
It is a nice addition, but I would like to see this taken further - structural regular expression awk. It is waiting to be implemented for 35 years now.
>A big thank-you to the library of the University of Antwerp, who sponsored this feature. They’re one of two major teams or projects I know of that use GoAWK – the other one is the Benthos stream processor.
That's great to hear.
Are you planning to add support for xml, json, etc next? Something like Python's `json` module that gives you a dictionary object.
I'm not considering adding general structured formats like XML or JSON, as they don't fit AWK's "fields and records" model or its simplistic data structure (the associative array) very well. However, I have considered adding JSON Lines support, where each line is a record, and fields are indexable using the new @"named-field" syntax (possibly nested like @"foo.bar").
For non-awk tools, csvformat (from csvkit) will unquote and re-delimeter a CSV file (-D\034 -U -B) into something that UNIX pipes can handle (cut -d\034, etc). It’s worth setting up as an alias, and you can store \034 in $D or whatever for convenience.
During a recent HN discussion on pipes and text versus structured objects to transfer data between programs, I started wondering if CSV wouldn't be a nice middle ground.
I phrase that carefully. "Better"? "Worse"? Very subjective. But in the current environment, "likely to beat out CSV"? Oh, most definitely yes.
A solid upside is a single encoding story for JSON. CSV is a mess and can't be un-messed now. Size bloat from endless repetition of the object keys is a significant disadvantage, though.
I'm not familiar with F#, but I do hate CSV tools that try type inference on data; in my opinion the csvkit tools should have the -I option on by default.
The thing is, I've used .NET a lot, and C# and F# I can code in my sleep. The same object system, integrated in PowerShell, makes it really hard to use.
I can't tell whether the UNIX people have lost their way, or just the demands of modern shell scripts cannot be met by typical shell philosophy - that is, piping together the output of small, orthogonal utilities.
The emergence and constantly increasing complexity of these small, bespoke DSLs like this or jq does not inspire confidence in me.
> demands of modern shell scripts cannot be met by typical shell philosophy
That. Pipes and unstructured binary data isn't compositional enough, making the divide between the kinds of things you can express in the language you use to write a stage in a pipeline and the kinds of things you can express by building a pipeline too large.
>In general, using FPAT to do your own CSV parsing is like having a bed with a blanket that’s not quite big enough. There’s always a corner that isn’t covered. We recommend, instead, that you use Manuel Collado’s CSVMODE library for gawk.
A good and useful addition. There's a mention to CSVMODE, a gawk library. I wonder if it could be extended to support the functionality that goawk's `-i csv` has.
Obviously, does not tell the whole story as this test was limited to "count" and an interpreted language is expected to always be slower compared to a precompiled command, but, it might be relevant to a user deciding what tool to use. Also, might be instructive as to room for improvement in the go code (or possibly the go code could use the c lib)-- I note that even if the goawk command is '{}' the runtime is still about the same.
full results:
---
goawk:
1000001 7000007 real 0m0.435s user 0m0.435s sys 0m0.031s
1000001 7000007 real 0m0.413s user 0m0.419s sys 0m0.024s
1000001 7000007 real 0m0.425s user 0m0.430s sys 0m0.024s
xsv:
1000000 real 0m0.157s user 0m0.141s sys 0m0.013s
1000000 real 0m0.156s user 0m0.141s sys 0m0.012s
1000000 real 0m0.158s user 0m0.142s sys 0m0.013s
zsv:
1000000 real 0m0.066s user 0m0.053s sys 0m0.010s
1000000 real 0m0.077s user 0m0.060s sys 0m0.012s
1000000 real 0m0.069s user 0m0.056s sys 0m0.010s
python:
1000001 7000007 real 0m1.589s user 0m1.553s sys 0m0.026s
1000001 7000007 real 0m1.583s user 0m1.550s sys 0m0.025s
1000001 7000007 real 0m2.122s user 0m1.675s sys 0m0.037s
---
The script for this was:
---
echo 'goawk:'
(time goawk -i csv '{ w+=NF } END { print NR, w }' < worldcitiespop_mil.csv) 2>&1 | xargs
(time goawk -i csv '{ w+=NF } END { print NR, w }' < worldcitiespop_mil.csv) 2>&1 | xargs
(time goawk -i csv '{ w+=NF } END { print NR, w }' < worldcitiespop_mil.csv) 2>&1 | xargs
Thanks for that! zsv looks amazing. Yeah, it's definitely going to whip GoAWK, what with SIMD parsing and careful memory handling. I've done a couple of basic things for GoAWK's CSV performance, but haven't profiled or looked at allocation bottlenecks (absolute performance definitely wasn't my first focus).
Yeah, sorry about huge.csv -- I found it online somewhere originally by searching for something like "large csv example", but can't for the life of me find it now. It's a monster 1.5GB file with 286 columns including quoted fields (whereas worldcitiespop only has a few columns and it doesn't look like it has quoted fields). I can upload to a file transfer service and send a link to you if you want ... though I should really update my benchmarks to use an easily-downloadable file instead.
If you're just using AWK features, mawk is still the fastest. GoAWK is faster than awk in most cases, on a par with gawk, but still not as fast as mawk (see https://benhoyt.com/writings/goawk-compiler-vm/#virtual-mach...) ... except for scripts that make heavy use of regexen. Unfortunately Go's regexp package is still quite slow. You could also try frawk, which is a JIT-optimized AWK written in Rust -- it's really fast, and shares some of the CSV features (but not the @"named-field" syntax).
But for most everyday usage, even for large inputs, GoAWK's performance is quite sufficient. The CSV support, and its use as a Go library -- they're more important to me than raw speed at this point.
You could just replace those commas within the quote with some other string or remove them if you wanted with some awk or sed. Here is a somewhat relevant example from stackoverflow: