Hacker News new | past | comments | ask | show | jobs | submit login
Decoded: GNU Coreutils (2018) (maizure.org)
261 points by mr_o47 on Sept 8, 2023 | hide | past | favorite | 123 comments



"Many of these utilities are approaching 30 years old and include revisions by many people over the years."

"They are not designed for long life or to scale beyond their role."

Would love to see some examples from the author of programs he believes are "designed for long life" that have been around 30 years.

Or even ones he thinks will be around for 30 years.


Well, what exactly is their role? Is there a limit to what we can do with these things?

To test a little programming language I made, I created a testing framework with bash and coreutils. I felt guilty about not using a "proper" language at first but it works so well. In parallel too.

I found that the the only thing I couldn't test was the argv[0] of the program. No matter how much I twisted the programs, I couldn't get them to do exactly what I wanted. So I sent a feature request and a patch to coreutils to give env this feature:

https://lists.gnu.org/archive/html/coreutils/2023-08/msg0006...

Looks like it's gonna make it in. A new feature for this old program.



Not part of coreutils though. I want to keep my dependencies to a minimum. The GNU coreutils are ubiquitous.


Hey if this makes it congratulations! I've never contributed to something so ubiquitous before but often thought how satisfying it'd be.

Like in years down the line, maybe as you retire and hang up your keyboard, you could sit back and smile as you realise your code is still deployed on millions, possibly billions of devices? That the code could far outlast 99.9% of code code anyone has ever written?


Well, I don't know if they're gonna use my patch. I reported a bug to gpg once and sent a patch in but in the end the maintainer rearchitected the code and gave me attribution. Perhaps they require copyright reassignment?

I can't deny I'm gonna be really happy if they do use it.


We generally give credit if at all possible. Even if a patch is extensively adjusted, we'll commit under the original author's name


You could’ve used bash’s exec which allows specifying argv0


I did try it. Didn't work in combination with env.

  # Sets PWD and SHLVL and the latter apparently can't be unset
  env -i VARIABLE=value bash -c 'exec -a program ./program'

  # Sets env's argv0, not my program's
  bash -c 'exec -a program env -i VARIABLE=value ./program'

  # SHLVL still set
  env -i bash -c 'unset SHLVL; unset PWD; exec -a program ./program'

  # SHLVL and _ still set
  env -i zsh -c 'unset _ HOME PWD LOGNAME SHLVL OLDPWD; ARGV0=program
The original discussions, linked from my post:

https://lists.gnu.org/archive/html/coreutils/2023-03/msg0000...

https://lists.gnu.org/archive/html/coreutils/2023-03/msg0001...


Why do you need to remove all of these from the environment?


To test my programming language. It's a freestanding lisp interpreter that doesn't link to libc. I wrote the code that handles the environment variables and in order to test it I needed full control over the program's inputs including its environment. The env utility provides this control by emptying the environment and setting only the variables I specify, solving 90% of the problem. Only thing I still can't control is argv[0]. With this new feature upstreamed, my test suite will be complete.

Here's the code if you wish to take a look:

https://github.com/lone-lang/lone#testing

https://github.com/lone-lang/lone/blob/master/scripts/test.b...

Without env -i, the folllowing tests would not be possible:

https://github.com/lone-lang/lone/blob/master/test/linux/env...

https://github.com/lone-lang/lone/blob/master/test/linux/env...

https://github.com/lone-lang/lone/blob/master/test/linux/env...


Glibc provides ld.so which can:

    --argv0 STRING        set argv[0] to STRING before running
Maybe that could be useful?


I did not know about that possibility! It requires glibc and its dynamic linker though, doesn't seem to be as widely available as the GNU coreutils. I developed lone inside Termux: it does not have ld.so but does have env. There's a GNU binutils ld but it did not recognize the --argv0 option when I tried it.


ld.so is the dynamic linker. It is always used when invoking a dynamically linked program but you can invoke it manually which allows you to specify optionsfor the linker. But the name and path you need to invoke depends on the architecture and possibly distro, e.g. for amd64 these days its

/lib64/ld-linux-x86-64.so.2 --argv0 argv0isalie yourprogramm

ld is the (not dynamic) linker. It doesn't invoke your program at all.


I think that was meant to be understood in context as "meant to be actively developed" (as opposed to maintained) for a long time. The practices the author lists are ones that in larger programs would typically be criticized for damaging maintainability as the program grows.


So to be rewritten from scratch every 2 years. CADT strikes again. /s


I had to look this up... CADT == Jamie Zawinski's "Cascade of Attention-Deficit Teenagers"


I think perhaps the author meant "long development life." As in, they are basically write-once utilities. Part of the benefit of doing "one thing well."


weirdly I interpreted "long life" as a continually running program vs something that had a very clear input, short lived execution, and a clearly defined output.


Emacs?


TeX


See also:

* How the GNU coreutils are tested: https://www.pixelbeat.org/docs/coreutils-testing.html

* Exploration of each of the coreutils commands: https://ratfactor.com/slackware/pkgblog/coreutils

* Command line text processing with GNU Coreutils: https://learnbyexample.github.io/cli_text_processing_coreuti... (my ebook that covers 20+ text processing tools)


Related:

Decoded: GNU Coreutils (2018) - https://news.ycombinator.com/item?id=29871037 - Jan 2022 (7 comments)

Decoded: GNU coreutils (2019) - https://news.ycombinator.com/item?id=26411908 - March 2021 (38 comments)

Decoded: GNU Coreutils - https://news.ycombinator.com/item?id=20328650 - July 2019 (55 comments)


Fun fact, if you install coretuils from homebrew on MacOS, since MacOS already ships with od(1), od from coretuils is installed as god(1)


Well of course it is, the "G" lets you distinguish the Gnu "od" from the OG "od"


I noticed at least one error, if the author is here. The short description on the shred page[0] is actually the description for csplit[1]. It should be something along the lines of "overwrite a file to hide its contents, and optionally delete it".

[0]: https://maizure.org/projects/decoded-gnu-coreutils/shred.htm... [1]: https://maizure.org/projects/decoded-gnu-coreutils/csplit.ht...


Cool, I didn't know this existed. I think simple ones like `yes` can be very interesting just to see how the base code of a utility (that writes to stdout) is written.

https://maizure.org/projects/decoded-gnu-coreutils/yes.html


Replying to myself because the thread got trolled:

My point was in part that it's valuable for even a simple utility to be well written and optimized and that it's nice to have these minimal examples to learn about how to, e.g. write output very quickly. The program is so short that presumably the number of lines is unimportant, and if the author knows how to do it they might as well make it as fast as possible so it's never in the way, and so we can learn from it. That's why I think it's a good example.


also quite interesting to compare with other implementations from the present and the past:

https://github.com/openbsd/src/blob/master/usr.bin/yes/yes.c https://github.com/freebsd/freebsd-src/blob/release/4.0.0/us...


here is a Go implementation for fun:

    package main
    
    import "flag"
    
    func main() {
       yes := flag.String("m", "y", "message")
       flag.Parse()
       for {
          println(*yes)
       }
    }


Yeah, GNU yes could be about that simple but it's a good deal more complex to obtain the best performance possible:

https://www.reddit.com/r/unix/comments/6gxduc/how_is_gnu_yes...


um, who cares? most times people are needing to use "yes" as part of a "configure" script or similar. even if GNU yes was 10 times slower, it would NEVER be the bottleneck in any situation. so whats the point?


I optimized yes because:

* The fast version is still simple enough

* The general functionality being provided is to output arbitrary data repeatedly, and it can be useful to do this as fast as possible


I presume somebody had a need, and scratched the itch.

With modern systems we can have 8 channel RAM and 128 PCIe lanes to feed a system stuffed with NVMe drives. The amount of throughput that can be obtained is nuts, and at that point all sorts of weird things can become an unexpected bottleneck.

This applies even in consumer systems. Suddenly your game loads far slower on a NVMe than it could because it never occurred to the programmer that instead of the disk, the JPEG decoder can become a bottleneck when you can read compressed data at 7 GB/s.


a decade or so ago, I ran into a problem where "yes" was already too fast, feeding it to apt-get install or something (in a context where we knew what the prompts were doing and if anything unusual happened we'd fail other checks immediately) and it had some clever input buffering going on... that wasn't clever enough and locked up on getting a megabyte of yes'es on the first read() call.


Check out 'stdbuf' perhaps?


I've heard before that a lot of GNU programs are stretched like this so that they don't resemble anything from proprietary AT&T unix. E.g. all the extra options in basically everything and the ridiculous optimization of things like yes.


I remember reading that the GNU project was rather keen on avoiding and removing artificial restrictions. In other words, if a feature was useful and reasonable then it should be seriously considered for inclusion. As for the aggressive optimizations that are sometimes seen, I assume it’s just people wanting to improve a tool that’ll be used everywhere, more for fun and/or notoriety than anything else.


Because you can't imagine a case its needed it must not exist? I also remember being a junior dev who thought they had all the answers.


You'll have to ask the author that. Now, I'd like to ask the author who cares how you implement 'yes' in Go ;-).


yes C lines of code: 112

yes Go lines of code: 9


It feels like you're making a bad-faith argument here. You can implement 'yes' in a straightforward way in a couple lines of C, too.

  main(int argc, char** argv) {
    while (1) {
      if (argc > 1)
        for (int i = 1; i < argc; i++) printf("%s%c", argv[i], (i == argc - 1) ? '\n' : ' ');
        else puts("y");
    }
  }
The point other folks are making is that it's written differently for a reason. Maybe not a reason that's important, but at the very least, let's try to compare apples to apples.


I don't see include, which means you're ignoring warnings aren't printing them in the first place. Also testing against a number instead of boolean. Also you have a horrible hack instead of proper flag parsing. And you're also abusing brace elision just to reduce LOC. And again abusing ternary syntax for the same reason.


If you start with code golf, as you did, then this is where you end up. The only way to win is not to play?


mine is just normal idiomatic Go code. thats not true of the C code.


It might be ugly, but that is idiomatic C code. C didn't even have boolean types until C99, and even then it's an "extension" of an integer type.

You could argue about the loop itself, after all K&R specified "for(;;)", but the other commonly used (ergo idiomatic) infinite loops use precisely the same number of lines. "while(1)" is a perfectly idiomatic manner to create an infinite loop.

Likewise a void return type for main was entirely legal until C99. The BSD yes(1) I've laying around only prints the first argument, so flag parsing? What flag parsing?

Yes in nine lines of C inclusive of preprocessor macro invocations and white space.

  #include <stdio.h>
  
  int main(int argc, char **argv) {
    const char *phrase = argc > 1 ? argv[1] : "y";
    while (1) {
      printf("%s\n", phrase);
    }
    return 0;
  }


I don't see proper flag parsing, I see an argv hack.


There are no flags in yes(1) ergo there's no need for "flag parsing". yes(1) takes one optional string as input, and that's exactly what argv provides. I'm not sure what you think "flag parsing" is bringing to the table here, but checking the array of command line parameters and accessing an element is pretty far from a hack.

If it's more comfortable you can also declare argv as an array of character arrays e.g. char *[], but that won't change the line count.


It's a lose for C either way. Either it cant parse flags, or we remove that requirement, and my code goes from 9 lines to 6.


The requirement to parse flags is your own. You can remove it from your go program. You only need to parse one string, if it exists.


[flagged]


C can parse flags and there is no requirement here for it to parse flags.


It's a lose for C either way. Either it cant parse flags, or we remove that requirement, and my code goes from 9 lines to 6.


C can obviously parse flags. See for example the plethora of C software that does so, such as many things from GNU Coreutils.

There is no requirement to parse flags for the basic functionality of yes. You implemented that yourself. You can remove your own requirement whenever you want. You don't need to parse a flag, at most you need to parse a string from the command line arguments.

I wonder at this point how you define "flag".


It's a lose for C either way. Either it cant parse flags, or we remove that requirement, and my code goes from 9 lines to 6.


Who cares though? We get it, you prefer golang, congrats?


I am not seeing a technical argument here against the previous points, only one against the commenter:

https://wikipedia.org/wiki/Ad_hominem


But you're also arguing in bad faith. Your go code is shorter, okay, but it doesn't do the same thing as the GNU yes code, so what point are you trying to make? I can also link to philosophy 101 wikipedia articles:

https://en.wikipedia.org/wiki/Straw_man


I think I have made it pretty clear already, but here it is again:

the Go code has MORE functionality (flag parsing) with LESS code. yes its not as fast, and yes the executable is larger, but for many, thats a good tradeoff for the extra standard library features, and the reduced LOC/code complexity. sadly as of yet, I haven't seen any cogent technical arguments against my points thus far.


> the Go code has MORE functionality (flag parsing) with LESS code.

Your code does not have more functionality than GNU's yes as written. It's less code you have to write because of the flag parsing code that has already been written, and it's incompatible with GNU's yes because yours requires -m to change the message.


> Your code does not have more functionality than GNU's yes as written.

it has flag parsing


Which does not do functionally more than the C version that was shared by inferiorhuman.

For an extremely simple utility like the 'yes' command that is compiled and distributed as a binary to trillions of installations what metric do you consider more important, size and speed? Or lines of code in the source? Think about this in engineering terms, everything is a tradeoff and it's your job to come up with the best solution.

I'm genuinely curious to hear your argument.


> I'm genuinely curious to hear your argument.

previous comments have demonstrated this not to be the case, so I will stand by my previous points. I have already made over 10 comments on this one topic, so if any aren't already convinced, they never will be, either because they disagree with the tradeoff, or they just have stockholm syndrome for C.


You've demonstrated nothing and made no discernable argument to anyone. Best of luck in the job search my friend.

Also, take a look at openbsd's version of yes

https://github.com/openbsd/src/blob/master/usr.bin/yes/yes.c


more lines of code, and still doesn't have flag parsing


There are no flags to parse. Why are you adding flag parsing? This would fail a junior interview Steven.

I mean, yes, go has proper flag parsing as part of the standard library and C doesn’t. Yes that’s going to make a line count difference but it’s also why code golf arguments are pointless.


> go has proper flag parsing as part of the standard library and C doesn’t

That's the whole point. Every single command line program needs command line parsing. Go helps me get the job done, C forces me to write my own parser, or find some third party one.


Yeah, but it’s horses for courses. The C version can be deployed in far more places and can be far faster than the Go equivalent. Which is “better” is a contextual judgement call. There’s plenty of weird architectures out there that run C and almost nothing else.


Yes takes a single string with no embellishment, and that's what C provides. There's nothing additional to parse. There are no flags, no additional options, nothing else to configure… and that's by design as there's simply no need.


It's a lose for C either way. Either it cant parse flags, or we remove that requirement, and my code goes from 9 lines to 6.


You can write a perfectly legible 4, 5, or 6 line version in C.


OK I am waiting...


On my mac the go version of yes creates a binary that's 30x larger than the yes binary I have on a linux machine. 2MB vs ~65KB.


This.

In fact, most embedded Linux thingy will be running the busybox version of core utilities instead of gnu coreutils for binary size reasons.

I other words, even gnu coreutils is too big.


Right but that's mostly because it includes its runtime and the C version does not.

It's pretty easy to avoid this problem anyway - just combine multiple tools into one binary like Busybox does.

In any case the simplicity of the Go code is not really related to the binary size.


on mine it is 1,354 KB. I prefer the 10x LOC savings over a megabyte of hard drive, but you do you.


It's not about hard drive space, it's about start up time.

The 10x LOC (despite being a huge exaggeration) is also the fixed cost, not the marginal cost. You're only forging main loop and includes, not any fundamental complexity.

This is also a funny argument coming from a Go programmer considering that Go trades off conciseness and expressivity for simplicity. Show me some of your favorite Go and I'm sure we can replace it with some concise C++.


> C++

I just physically shuddered


I bet you did.


But if everything was written in Go it might be 1 megabyte * the number of users of coreutils. Might be worth using a few more lines of code to save a little bit of space on a lot of machines.


That isn't an apples-to-apples comparison. A C version could be similarly short, but the GNU yes implementer got carried away with efficiency considerations. If the Go program did the same, it would be longer.


16 lines of go, with comments & whitespace


134 lines of C, with comments & whitespace


I'm not saying this is the reason, but unnecessary (or sometimes dangerous if used in real life) optimizations have been created to speed up QA runs (of course, that only works if what you optimize away is not part of what you're trying to test), e.g., libeatmydata.

But for 'yes', I'd agree with you, though I guess the answer to "who cares?" is that whomever wrote it cared. They could have a legit performance reason or may just have done it to show they could.


Here's a Brainfuck implementation:

    ++++++++++[->++++++++++++>+<<]>+[.>.<]
(Unfortunately, Brainfuck doesn't support command line flags.)


[.>.<] かわいい


(for those who can't read Japanese, and at risk of over-explaining, the parent called part of the program cute.)


Fyi before anyone else decides to read the idiocy that follows in this thread, this is guy is a known troll with a long history of this exact behavior. Same person:

https://news.ycombinator.com/item?id=27862463

I think this person is just mentally ill unfortunately.


when people resort to doxing, it shows how truly pathetic they are. I pity those people.


Kind of a fun list of basic utilities. I've been using UNIX for a loooong time and had never heard of e.g. `shred`, `shuf`, or `factor`. Makes me want to try

  sudo find / -type f -exec shred {} \;
to see how far it gets before killing itself (on a VM or easily re-flashed machine of course).


> to see how far it gets before killing itself (on a VM or easily re-flashed machine of course).

I did this, but with dd -- it completed. Was very anti-climatic. I was hoping it would crash or at least disconnect me, but the kernel, sshd and bash were still in memory and happily returned me to a prompt where I couldn't really do anything.


There's where the real fun starts. Now try to recover from this extremely restricted shell!


Doing at the block device level you're gonna have stuff stick around in cache, file-wise would blow things up faster


On Linux, it'd likely go until completion. You can't write to executables and libraries that are currently running (ETXTBUSY), so shred can't trash either itself or find.


The exception is shared libraries on Linux. Shared libraries are mmapped into the address space of the executable that uses them. Back in the day, mmap used to support a `MAP_DENYWRITE` flag, which it no longer does - so you _can_ write to a shared library that is currently in use. I found this out the hard way when a shared library that I'd written (which maintained some state in its constructor) started crashing mysteriously anytime it was used by more than one process. It took two weeks of debugging to figure out why this was happening. [1] has more details about `ETXTBUSY`, `MAP_DENYWRITE` and shared libraries.

[1] https://lwn.net/Articles/866493/


That doesn't seem right. I can delete nano's executable while it's running, and rm can remove itself.

What am I missing?


rm doesn't actually modify the file. It does an unlink which removes the link from the filename to the file.

Essentially, a file can have 0 or more names linked to it. As long as a file has at least one name or at least one process with it open, it will persevere.

By contrast, shred actually writes to the underlying file.


Ahh, for some reason I completely missed that shred was involved.


True, but I wonder if it would first get stuck on trying to shred e.g. /dev/stderr or if after shredding the files in /etc some daemon woke up and tried to read a file there and choked.


I'd be curious what it did to /sys and /proc


I do like that /bin/true can actually fail and return false, which technically makes a "Not /bin/false" invocation more resilient: https://github.com/coreutils/coreutils/blob/master/src/true.... (and yes, I know it's the most unlikely thing, I just found it funny)


The other interesting thing about the true command is how much more complicated it got then it needed to be.

first an exercise

  touch mytrue
  chmod u+x mytrue
  ./mytrue
  echo "error code for mytrue is $?"
  
This is literally how true started life. yes it is very zen.

The first offense was legal. All code had to have a copyright disclaimer. even an empty file? Yes. so now it was a file with a copyright disclaimer and nothing else. And the koan-like question comes to mind is "Can you copyright nothing?" well AT&T sure tried.

Then somebody said our programs should be well defined and not depend on a fluke of unix, which at this point was probably a good idea. So true finally had code. It was "exit 0"

Then somebody said we should write our system utilities in C instead of shell so it runs faster. openbsd still has a good example of how this would look.

http://cvsweb.openbsd.org/cgi-bin/cvsweb/~checkout~/src/usr....

At some point gnu bureaucracy got involved and said all programs must support the '-h' flag. so that got added, then they said all programs must support locale so that got added. now days gnu true is an astonishing 80 lines long.

https://github.com/coreutils/coreutils/blob/master/src/true....

Which is fine I guess, but that is a lot of code for a program that by definition "Does nothing, successfully"

http://trillian.mit.edu/~jc/humor/ATT_Copyright_true.html


Only if run with options, right? It looks like just running `true` is unaffected unless the comment is misleading me.


Yeah, I don't see it either. If being run without --help or --version, true can only ever return EXIT_SUCCESS.

However, I find it interesting that true and false use the very same implementation.


As a novice programmer trying to sharpen my grasp of how to fruitfully apply DS&A, are there any of these I should look at in particular?


The cleverness inside coreutils is mostly around choosing effective ways to interface with the kernel, e.g. using copy_file_range() instead of read()/write() to avoid having to copy the data into userspace.

It's more a software engineering endeavour instead of computer science.


Maybe I’m misunderstanding but what do data structures and algorithms have to do with CLI tools?


Don't CLI tools use data structures and algorithms?


Not massively. I mean, an array is a data structure and looping through it is an algorithm, but that’s a bit basic even for a 101 DSA course.


I might be missing a point of this site, but don't we have man (or info) pages for each of these?


Rather than usage information, this site details how the programs work internally.


Try to avoid gnu options and niche commands.

Stick to busybox commands as much as you can.


Honest question: why? I'm not really interested in portable scripts, usually I just want to get a job done as fast as possible and move on (for serious automation I prefer Python, shell is for more interactive use for me). Is there any upside to sticking to busybox commands in this case?


Wow, when I read that, I realize how much I have to teach, but there, it is worth restarting from scratch, it seems the basics are not even here.


Sounds like there’s still a little left for you to learn about not being pretentious.


That guy is starting from very far away, HN not the place as it is not enough to fill in that amount of blanks.

And from the first thing you learn in real life, is humility, but sometimes, some people makes you feel you know everything, like here... but I very well know I am average/normal, which worsen even more his/her case. Yeah, a bit like John Snow.

This is an horrible feeling.

Not to mention, this post is border-line passive-aggressive... so...


Why do you feel that way? I see nothing that indicates that this person is clueless.


come on... if you don't see anything it means you would need the same amount of work...

what's this ???

Let's presume this is a chatgpt troll.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: