Hacker News new | past | comments | ask | show | jobs | submit login
How to Clean Text Data at the Command Line (ezzeddinabdullah.com)
84 points by ethink on Oct 18, 2020 | hide | past | favorite | 51 comments



I... am... just like baffled at why Docker is used. Are any of these like not normal unix utilities? What dependencies are you pulling in after GNU/Linux... which is where CONTAINERS come from? I'm even only even putting "GNU" here because it's literally every fucking Linux distro on the planet containing the utilities referenced. Is it common place now-a-days to bloat a docker image with text data or something? Is docker only for the csvlookup tool?

Not trying to be a supermassive asshole but I do feel like I'm losing my mind reading this... In order to use GNU/Linux one must install Linux, setup a container runtime, and then! Use the same utilities available to them in the runtime... from a container?

I feel like if this is the state of a tutorial for someone new to these utilities we're all pretty fucked and should probably take a step back to re-evaluate the mess we've made. I'll go crawl back into my hole now to do just that.


I was also baffled. However, if you follow the link to the "why we use docker" post you'll see the author is using a Mac and had difficulty before while following instructions written for Linux.

Data scientists using Macs for everything and sometimes needing to run Linux in a container to get work done is more common than you might think.


    brew install coreutils
Lot less complexity than running things in docker.


Some programs have different flags and behaviors between Mac OS and linux. This has bitten me quite a few times. Even on Linux there are often competing versions (BSD and GNU version might have different flags for example).


Was just about to say. What text editing powerhouse from Linux (sed, awk, grep etc.) are not available on Mac? The answer is: none.


There are some rather archaic behaviors of the versions of these tools that come by default on MasOSX.

For instance, on a mac the 'sed' command doesn't understand the difference between space and tab

https://stackoverflow.com/questions/40311339/using-sed-on-ma...


Oh, very likely true. But nothing stands in your way to just replace them with the GNU versions. Still way easier than running a Docker container...


You do however now need to tell all the users of your tutorial to go and install brew, then install coreutils and then to replace every use of "sort" with "gsort", etc.

Running a docker container, regardless of what it does under the hood, is extremely easy on a mac. You install a program then do "docker run ...".


    "Able to install packages"
Hmmm no that don't look as good on my resume as "autodidact docker wizard"

There's something to be said for the cool-new-tech driven approach, when it comes to resume building.


That won't replace the utils though, will it? You get ones with a g prefix as far as I remember.


Professional data scientists should probably do this stuff on company servers, not on their laptops.


I was also wondering why it starts with docker. But since this post is targeted for data science or becoming data science folk maybe the idea is to also support windows users?


If you are interested in platform independence, why even bother with the Unix utilities? You can accomplish the same with Python, have results that are more readable to the non-Unix crowd, and not rely upon sometimes esoteric knowledge of the Unix utilities (and the inconsistencies between vendors).

That isn't to say that the Unix utilities are without merit. I have used them extensively for similar tasks. That being said, it is in environments where everyone used some version of Unix (typically Linux, Solaris, or Mac OS X).


> If you are interested in platform independence, why even bother with the Unix utilities? You can accomplish the same with Python,

Because then you're also looking at getting people using the same version of python and your dependencies - and it's likely slower.

> That being said, it is in environments where everyone used some version of Unix (typically Linux, Solaris, or Mac OS X).

If only there was a technology aimed at sharing an exact same version for easy use ;)


Would it not work in WSL though? If you wanted windows support.


Sure. It was just an idea why docker was choosen :)


Yep, they all come with WSL, since you basically install Ubuntu (or another WSL-compatible distro).


Docker 'can' be used as a way to provide a shell which has a consistent set of programs installed. (As a bonus, using Docker this way let's you avoid 'polluting' the host OS with various packages).

Sure, this case where it's just coreutils + csvlookup, the benefit for the added abstractions isn't so great. Using docker like this makes more sense if there are more dependencies. (Maybe it's just easier to say "use docker" than to explain how to install a dependency on different OSs?).

In general, I think it's fine for people to be able to solve problems without having a full understanding of the mechanisms involved in getting the tool to work. If they need to know it later, they'll learn it then. This is a top-down approach -- But, naturally, there'll be some awkward solutions on the way.


No. I’m sorry but I disagree in this case. These tools are literally installed on every posix machine on the planet. The whole point here is that there are no dependencies this is the “system.” It’s like trying to explain how an engine works by only looking at the turbo, “Oh yeah, gotta have a turbo or your car won’t drive.” ... You mean I gotta have an engine or I can’t even use a turbo?

The whole point of my post is that running a container otecshstration system to accomplish any of this, for someone new, is insane. It is absolutely 100% only adding complexity and making things more confusing. How many people will not realize these utilities are on every computer they use? How long until someone poisons these repacked uselessly opaque versions? This makes things easier for no one and only furthers to complicate things by adding unnecessary complexity.

If someone is on a Mac, then they are now running a Linux VM, then running containers, then logging into said fucked up environment and then finallly... running... awk?

macOS -> Linux -> Container Runtime -> SSH-> finally use awk, sed, head, or anything else. Why stop there? We can probably then start up Qemu from a container, boot MacOS, and then simply repeat the process for maximum usefulness.


> If someone is on a Mac, then they are now running a Linux VM, then running containers, then logging into said fucked up environment and then finallly... running... awk?

Err, no, they are not doing this. That's not how it's working for a start (there's no SSH here) but the user experience is literally

Install a standard application, downloaded like any other.

Open terminal.

Run "docker run ..."

That's it.

> It is absolutely 100% only adding complexity and making things more confusing. How many people will not realize these utilities are on every computer they use?

They aren't on every computer, and they are different on different computers. A point the author not only talks about in the linked explanation but describes how that bit them.


Docker for Mac runs in a VM, so that point still stands. Just because the complexity is hidden away behind a pretty interface doesn’t mean it isn’t there.


> Just because the complexity is hidden away behind a pretty interface doesn’t mean it isn’t there.

In the context of making things hard for a tutorial or complex for a beginner, it not being visible is the key point. Just like the complexities of how your CPU actually works aren't relevant for this tutorial.


This is a very strong response to what seems like an obvious point clearly laid out at the start.

This is a tutorial and set of examples, and it comes with a fully working environment with example data exactly setup to work. No difference between gsort and sort, no version issues or whether someone is using brew or macports or anything else. No python environment problems or pip Vs pip3 or whatever other random crap you need to deal with when trying to do something simple with lots of varied setups people have.

You are entirely free to not use docker.

> In order to use GNU/Linux one must

Nope. You can. You don't have to.


> This is a very strong response to what seems like an obvious point clearly laid out at the start.

The point is invalid and clearly untrue.

> This is a tutorial and set of examples, and it comes with a fully working environment with example data exactly setup to work. No difference between gsort and sort, no version issues or whether someone is using brew or macports or anything else. No python environment problems or pip Vs pip3 or whatever other random crap you need to deal with when trying to do something simple with lots of varied setups people have.

I really don't see how having to get Docker going is alleviating any of the above headaches but I'm far removed from these tools being "new to me." Setting up virtualization to run a process on my local machine which I can run without virtualization or containerization the very same way seems crazy to me. It seems like an awful amount of complexity to push onto beginners for what is likely to be zero benefit and gives them only what they had at the start...

> You are entirely free to not use docker. >> In order to use GNU/Linux one must > Nope. You can. You don't have to.

Yeah that's my point exactly... you don't have to at all in fact, Docker offers most literally nothing but opaque complexity. I was saying you MUST have Linux if you want containers because containers are a Linux feature and DNE on MacOS but all them sweet utilities are already there save the weird csv tool. I don't think the complexity of Docker is warranted for simply handling a CSV tool and I find it to be a disservice to anyone who is learning.

/shrug


What point is invalid? That it's a prepackaged setup for all users to avoid having these kinds of problems? It's pointed out right at the beginning and then a longer description is linked to about how the writer hit exactly this kind of problem because mac has different basic utilities and their sed worked differently.

> I really don't see how having to get Docker going is alleviating any of the above headaches

Because it is a single thing to get working that then leaves all users of your tutorial on all major platforms dealing with the exact same environment. Have you ever tried to work with getting end users up to speed only to discover that one is using python2.7, another has anaconda but pip is pointing to the wrong place, pip3 is required for someone else and one user has mac so needs to install brew then replace every standard call to sort with gsort. Then dateutils appears as some conflicted dependency...

> Setting up virtualization to run a process on my local machine which I can run without virtualization or containerization the very same way seems crazy to me

Then don't, you don't have to. Go and setup an identical environment. For those that have docker this becomes a single command to run, and those that don't can get setup quickly. If they're heading into data science they're likely to benefit from having at least a basic understanding of docker.

> It seems like an awful amount of complexity to push onto beginners for what is likely to be zero benefit and gives them only what they had at the start...

It doesn't give zero benefit, it removes a whole bunch of differences in setup that are entirely irrelevant to the tutorial.

> Docker offers most literally nothing but opaque complexity.

It offers a standard environment that puts all users in the same place.

> but all them sweet utilities are already there save the weird csv too

No they are not, not on a mac and not on windows.

> I don't think the complexity of Docker is warranted for simply handling a CSV tool

What complexity? 'docker run --rm -it X bash' is literally all I need to do.


csvlook/csvkit, apparently; something I never heard of before, which is actually not installed by default in ubuntu. He also included the data itself, which is pretty helpful, but which could have more usefully gone into a git repo somewhere.

Also "kids these days" all reach for docker. Otherwise any sane person would have just used cat instead of csvlook.

Upvoted the kid's article, because command line text cleaning is of TOWERING IMPORTANCE to any DS person worth their salt. But he is definitely an inexperienced kid.


I think the csvlook utility comes from the Command Line Data Science book the author cites. I read some of that book, there was good info in there, but ultimately it was a bit annoying because it's author based a lot of things on utilities he had written himself (like csvlook) which in many cases just did trivial things you could easily have done with shell commands or coreutils.


It seems lots of people's knowledge of awk is limited to printing fields, and they'll happily chain awk with a bunch of grep and sed when a single awk invocation would do the job without fuss. For instance, TFA uses

  awk '{print $1","$2}' | sed '1i count,word'
when you can just add a BEGIN block:

  awk 'BEGIN { print "count,word" } { print $1","$2 }'


Speaking of AWK, is there a more succinct version of the AWK programming language book that I can use for learning?

https://ia802309.us.archive.org/25/items/pdfy-MgN0H1joIoDVoI...


I wrote a book on GNU awk one-liners: https://learnbyexample.github.io/learn_gnuawk/

There's plenty of examples and exercises, plus an entire chapter dedicated to regular expressions.


On mobile; pardon the link format.

https://m.youtube.com/watch?v=43BNFcOdBlY

In an afternoon, you can learn enough to be more than dangerous.

The presenter drafted up a set of (easy, yet practical) companion exercises and a pt2 video with his own answers. I cannot recommend this talk enough.

Although I hadn't the opportunity to put the knowledge to use and have since forgotten it (i don't work in tech), i find comfort in knowing i can reacquire the power of awk even sooner than the already-short first time around.


GNU AWK User's Guide is reasonably succinct, while being fairly comprehensive.

https://www.gnu.org/software/gawk/manual/gawk.html

The manpage alone is quite useful, though lacks some useful details.

The awk FAQ provides general background, including diffeerences between implementations:

http://www.faqs.org/faqs/computer-lang/awk/faq/


You can use the notes of Bruce Barnett for quick lookups:

https://www.grymoire.com/Unix/Awk.html


Or do the case-conversion, minimum word-length restriction, and counting in awk itself, here using associative arrays:

  awk < file '
    {
      $0=tolower($0);  # lowercase conversion
      gsub( "[^a-z ]", "", $0); # remove non-alpha + space chars
      for(i=1;i<=NF;i++0) {
        # increment word count if word length > 2
        if(length($i)>2) count[$i]++ 
        }
      }; 
    END{ 
      # report
      for(word in count) printf( "%-20s %i\n", word, count[word])
    }
  ' | sort -k2nr -k1
This omits the header (can be trivially added). The external sort can be internalised in gawk using asort(), or by printing to sort via a command:

  cmd="print -2kr -k1"
  for<loop> printf( <args> ) | cmd
  close(cmd)


If you're going to those lengths then you might as well write it in Go or Python and get the performance boost on larger datasets.

However the point often missed when people see awk piped into other CLI tools is that code is intentionally optimised for one time writing rather than maximising awk's usefulness.

It's the same with the GPs comment too. In that example the awk code was longer than the awk|sed equivalent.

Don't get me wrong, I am a big fan of awk myself. But sometimes people get so hung up on better usage of awk that they lose sight of the point behind the command line example.


For the code here, which I used on an actual dataset (~500 NYT Headlines 1965--1974), the efficiency gaains of a Go / Python rewrite are ... slight.

  real    0m0.176s
  user    0m0.060s
  sys     0m0.060s
On an early-2015 Android tablet running Termux.

I've thrown multimillion row datasets at awk (usually gawk, occasionally mawk, nawk on OSX, and, hell, busybox on occasion) without any practical performance issues. I'm virtually always writing for one-off or project-based analyssys, not live web-scale realtime processing. A second or even ten won't be missed.

I'm also aware that building a pipeline out of grep / cut / sed / tr / sort / uniq / awk is often conceptually nearer at hand. It almost always mirrors how I start exploring some dataset.

But a quick translation to straight awk gives cleaner code, more power, easier conversion to a script, and access to a small library of awk-based utilities I've acumulated.

All with far-more-than-adequate performance.

We've spent far more time discussing this than coding, let alone running, it.


> For the code here, which I used on an actual dataset (~500 NYT Headlines 1965--1974), the efficiency gaains of a Go / Python rewrite are ... slight.

500 records isn't a large dataset. Not even close. Large would be orders of millions to billions. And yes, I have had to munge datasets that large on many occasions.

> But a quick translation to straight awk gives cleaner code, more power, easier conversion to a script, and access to a small library of awk-based utilities I've acumulated.

- cleaner: only if you find awk readable. Plenty of people don't. Plenty of people find Go or Python more readable.

- more powerful: again depends. You wouldn't have multithreading builtin like pipelines would. And Python and Go are undoubtedly more powerful than awk. I'm not knocking awk here, just being pragmatic.

- easer conversion to a script: at which point you might as well skip awk entirely and jump straight to Go or Python (or any other programming language)

- and access to a small library of awk-based utilities I've accumulated: that only benefits you. If you're having to sell the benefits of awk to someone then odd are they don't have that small library already to hand ;)

> We've spent far more time discussing this than coding, let alone running, it.

Some of the larger datasets I've had to process have definitely taken longer to munge than my reply here has taken to type :)

Disclaimer: I've honestly not got a problem with awk, I used to use it heavily 20 years ago. But these days its value is diminishing and a lot of the awk evangelists seem to miss the point of why awk isn't well represented in blog posts any more. It's both more verbose than pipelining to coreutils and less powerful than a programming language -- it's that weird middle ground that doesn't provide much value to most people aside those who are already invested into the awk language. Some might see that as a loss but personally I see that as demonstrating the strength of all the other tools we have at our disposal these days.


I don't have your scale problems. You might care to not impose them where they don't exist.


Don't be so ridiculous. You're the one imposing arbitrary problems by saying "everything should be written in awk" then writing an example that was an order of magnitude longer in both character count and execution time than the examples that you were suggesting was wrong.

All I'm doing is citing a few reasons why someone might prefer a terser pipeline but I hadn't realised this wasn't supposed to be an objective conversation and since I have no interest in engaging in pointless language fanboyism I'm just going to leave you to it.


"everything should be written in awk" is a rather dubious translation of:

a quick translation to straight awk gives cleaner code, more power, easier conversion to a script, and access to a small library of awk-based utilities I've acumulated.

https://news.ycombinator.com/item?id=24822697

You're not discussing what I've actually written.

Good day.


I addressed those points directly, which you ignored when you blamed me of arguing use cases specific only to me. Suffice to say your complaint was hypocritical given your statement previous to that was very specific to your use case (as I also pointed out in my aforementioned, but ignored, comment)


When you have such a large awk script, you may as well also demonstrate that Awk is a “real scripting language” by putting it into its own script with a `#!/usr/bin/awk -f` shebang line. Much easier to edit than trying to get your shell to do it. (Unless you’re using a fancy editing shell like Emacs’ shell mode; but at that point, why would you need Awk? You probably know far better how to use your own shell-runtime’s text-manipulation primitives.)


Sure. Though such scripts often begin as shell one-liners.

This one did. Never left it, as it happens.

C-x C-e FTFW.

(I expanded, indented, and commented the example for HN.)


I beg to differ. I did a lot of csv wrangling on Unix. Csv is a beast. My tools of choice ultimately was miller, an absolutely underrated tool:

https://github.com/johnkerl/miller


Totally agree. I think mlr is a wonderful CLI tool. It is very robust, and can handle/convert multiple tabular formats including csv, tsv, json, fixed-format, etc. It has a pretty decent text output formatting, with the --opprint flag.

I use to be very comfortable using awk/sed/perl/sort/uniq/tr/tail/head from the CLI for the sort of data cleaning this article is talking about. However, over the past year I've found I use VisiData https://github.com/saulpw/visidata for interactive work.

If I need to clean up the data first, I'll use mlr or jq as input to Visidata. If my data is too dirty for mlr, then I'll use Unix toolbox tools mentioned as input to mlr, jq or VisiData.

VisiData provides some ability to script, but when possible I prefer to have the shell do the scripting with all the tools mentioned as input to Visidata.


Oo, I've not come across this one before. Looks useful! In a similar vein of CSV tools that seem to be underrated, I recently came across Daff which does excellent tabular diffs.

https://github.com/paulfitz/daff


Hey, would like to share How to Clean CSV Data at the Command Line: https://www.ezzeddinabdullah.com/posts/how-to-clean-csv-data...


FYI best blog about data cleaning from command line is https://www.datafix.com.au/BASHing/.


Thanks for sharing, will check it out :)


I’ll probably catch a lot of flack but a lot of criticism in these comments for the author who finished something. It’s easy to sit in the cheap seats and cast stones, but this guy made something that works, and you can always make your own if you know better without tearing down the work of others.


Thanks for replying I really appreciate many comments and I respect all opinions except insults. I'm still learning and will always do and I consider my writing as a way to learn more about the field.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: