Hacker News new | past | comments | ask | show | jobs | submit login
Curl doesn’t spew binary anymore (haxx.se)
208 points by okket on June 17, 2017 | hide | past | favorite | 95 comments



I just looked at the commit and I was surprised at how nice the code is! The relevant line to my question is:

  if(isatty && (outs->bytes < 2000) && terminal_binary_ok) {
    if(memchr(buffer, 0, bytes)) {
memchr() returns the first occurrence of the byte 0 (your second argument), or NULL.

So a few things:

* what if your output is more than 2000 bytes?

* what if your output is binary but doesn’t contain a byte 0?

* what if your output is a normal UTF-8 string but contains a byte 0? ( see https://stackoverflow.com/questions/6907297/can-utf-8-contai... )

This is interesting to me as I am developing a tool that parses files in search for bugs and I need to ignore binary files. What I am doing right now: when checking a line, if it is not a valid UTF-8 string I skip the file. It's not really nice as I am doing this verification for every file's lines...


1. It is not sensible to check more than a small chunk of data, as it would result in significant performance penalties, both in high resource consumption (memory or temporary files) and in blocked pipeline (cURL streams data as it is received). Imagine `curl http://somesite/10GBoftext | grep "rarely-occurring-prefix" | do-something-with-found-data`—with a full scan, curl would use 10GB of memory, checking every byte before it sent things to grep and do-something. Without a full scan, do-something processes data live, and curl uses negligible resources.

2. Then the check yields a false negative, which is not a problem.

3. Then your UTF-8 string is unprintable, and the check will yield a true positive. The UTF-8/ASCII NUL character is not printable, despite being valid.

If one only assumes ASCII/UTF-8/Shift-JIS/similar, then a blob containing a null byte is guaranteed to be unprintable, while a blob not containing a null byte may be printable. That's good enough for a warning, telling you that you're doing something that you might not have intended.

Given that UTF-8 has become standard, it means that you will never realistically get a false warning, but may still get bonkers output. You can always overrule if you have a fetish for UTF-32.


> 1. It is not sensible to check more than a small chunk of data

You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?

> 2. Then the check yields a false negative, which is not a problem.

the binary will be printed on the screen, that's a problem

> 3. Then your UTF-8 string is unprintable, and the check will yield a true positive.

How is it that the zero byte is part of UTF-8 then?


>> 1. It is not sensible to check more than a small chunk of data

> You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?

Why wouldn't it? 2000 bytes is just 25 lines by 80 characters.


right, it sounded bad for some reason!


> You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?

Of course it does. You can curl the concatenated content of the library of congress to your terminal if you want to.

> the binary will be printed on the screen, that's a problem

No. Because printing the binary to screen is the current behaviour in all cases, the goal of this change is to reduce the incidence of it for quality of life.

> How is it that the zero byte is part of UTF-8 then?

Flash News: unprintable characters are part of unicode. NUL is one of them.


Thanks for your non-answers :)


> How is it that the zero byte is part of UTF-8 then?

It is a valid code point, just not a printable character. Unicode encodes every character that is or was in common use, not just the printable characters; this includes the control characters at the beginning of the ASCII table.


With regards to your example (1), your commands is not a TTY, thus the null check is never performed.


> I just looked at the commit and I was surprised at how nice the code is!

Most programs detect binary files like this. Here's git's version of the same function: https://github.com/git/git/blob/master/xdiff-interface.c#L19...


what if your output is more than 2000 bytes?

The announcement says that "curl will inspect the beginning of each download", and I think that comparison just turns off the check after at least 2000 bytes have already been output (see a few lines below the change you quoted, where outs->bytes is incremented by the amount of bytes that were output).

what if your output is binary but doesn’t contain a byte 0?

I guess curl will incorrectly recognize the binary as text.

what if your output is a normal UTF-8 string but contains a byte 0?

I guess curl will incorrectly recognize the text as binary, and you can use `-o -` to override that and output to the terminal anyway.


> I guess curl will incorrectly recognize the binary as text.

This doesn't sound like a really good test.


Looking for a NUL byte is the standard test that is also widely used by other tools such as grep or diff.


Interesting, I'm going to try to implement that in my file parser.


It may be nice, but it certainly is buggy:

  bool isatty = config->global->isatty;
  [...]
  if(!config)
    return failure;



Are you sure you can do without that null pointer check? I would have moved it to before the pointer dereference in the initialization of isatty (if you fear or know some compilers won't accept local declarations after statements in functions, split declaration and assignment of isatty, too)


> memchr() returns the first occurrence of the byte 0 (your second argument), or NULL.

...What if the first byte is 0? That would cause the if condition to fail, and the output would be treated as plain text.


memchr returns a pointer, not an offset.


Ah, that explains it.


There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

From what I understand, this means that no wild UTF-8 string would have a NUL byte anywhere in it, no?


It means that a UTF-8 string can contain the "NUL" character, which is a single null byte. While valid UTF-8, it is still an unprintable character, making it "binary" for most intents and purposes.


It's worth noting that, while NUL is valid UTF-8, there's a lot of software which will fail strangely when presented with it, since UTF-8 is so often used to interoperate with NUL-terminated C strings. If you write such a string to a C string then it will be silently truncated.

I ran into this once with an Objective-C program that created filenames from strings found in files. When presented with a string containing NUL, the code ran, but didn't really work. I'd get "foo" and then ask it to append ".txt" and the result would come back as "foo" still! And it depends on the context in which you use it. Using it as a filename truncated, because you ultimately go through the POSIX-level calls that use C strings. Printing it in the debugger truncated, as evidently it used C strings at some point in that process. Displaying it in a text field in the UI worked perfectly fine, though, as apparently that path never uses C strings and the NUL character is just an invisible zero-space character in the middle.


I ran into this when writing a Docker API client in Nim via its Unix socket! Was very confused until I remembered that.


It means that UTF-8 text will only contain NUL bytes in the same situations that ASCII text does. That is, usually only in binary files.


> what if your output is a normal UTF-8 string but contains a byte 0?

It's not really a valid character to print to a terminal and most terminals ignore it nor is it particular valid in a "text file".

As for the rest of your questions if the bytes in the file are randomly distributed which is common with compressed binary files the chance that there is not a zero byte in the first 2000 bytes is 0.03% which seems low enough.


> if the bytes in the file are randomly distributed

This is rarely true I believe, but binary files should often use the 0 byte because it sounds "practical" to use (going with the instinct here). So I'm guessing this test is "good enough"


I would agree as well that the code looks great and quite "idiomatic C"; but I'd prefer using 2048 and put the memchr() in the same if with another &&.


What's the advantage of 2048 over 2000?


I think putting the memchr() in the same if would cause unnecessary calling of the memch() function (and scanning of 2000 bytes) even when any of the other conditions is false! It will work great on a language like ruby, not C.


C && is short-circuit-evaluated. Something like 'if (y && (k = x/y))' won't crash if y is 0.


I think this is a bad idea. Trying to make programs intelligent and have them behave in unexpected ways usually is.

It will make curl fail if a 0 byte is outputted within the text. There are many reasons why this will happen. Software errors, UTF-Encoding, special use cases...

In fact, I run into this problem with grep from time to time and it is super annoying:

    > grep somefile.txt stuff
    Binary file somefile.txt matches
So this change makes the code base more complex and the behavior fuzzy.

I prefer elegant software that behaves predictably.


A NUL byte can't print in a terminal or show in an editor, so treating it as binary is a sensible thing to do when you are asking cURL to print it to your terminal. It is no longer text if a NUL byte is present (unless you use UTF-16/32, in which case, WHY?!).

I assume that you also inserted a NUL byte into your somefile.txt to make grep treat it as binary.

cURL behaves predictably now as well as before. It will warn you if you try to print unprintable data to your terminal, but can be told to ignore it, and won't affect output redirects.


Many legacy and contemporary programs change their output behaviour based on if they're outputting to a tty or to a pipe. Off the top of my head, ls (as old as you can get) and git (new) do not colourise their output by default for pipes, but do colourise for terminals. This is just curl trying to be a good citizen on the terminal, for which there is plenty of precedent.


this drives me slightly nuts but I get why people do it.. I really wish for some kindof smarter negotiation between pipes.

I often end up doing things like: watch --color juju status --color or "ls -c | less -r", etc.


Not just colour. The actual formatting and layout of ls output changes, too.


And it's sometimes helpful, but sometimes a big pain because you just wanted to pipe it into less.


> Software errors

Which could mess up your terminal, so curl is doing the right thing.

> UTF-Encoding

On Unix, terminals would be completely broken if they used UTF-16. Other Unicode Transmission Formats don't use null bytes.

> special use cases

Such as?

> In fact, I run into this problem with grep from time to time and it is super annoying:

That's not applicable here, because grep only prints parts of the file, while curl prints out the whole file.


Everyone working with binary in the terminal should be told about about stty sane. Very useful.


There is also reset(1).

https://linux.die.net/man/1/reset


My coworkers and I used curl, and the linux kernel, as part of interviews. The candidate was asked to download the source to the Linux kernel during the interview; occasionally they would curl it to the tty. (We don't ding them for this, as I do it all the time myself… In fact, if anything, it provides a great opportunity to see if they know about reset, or the details of their mistake (e.g., why did that have so many odd effects on the terminal? why did the title change? etc.)) Surprisingly, the binary output would screw up tmux beyond repair by reset. (There would be characters printed outside the "terminal"'s area, for example. We suspect tmux has a few bugs because of this.)


Fuzzing tmux's terminal would probably lead to some interesting bugs.


I have to agree. curling binary to shell is awkward, but nothing that can't be fixed with a `reset`.

Changing behavior of something that is as popular as curl to protect edge cases is strange, to say the least.



I am sympathetic to your comment because it feels like we are babying users. To me, this is no different than when FreeBSD and GNU modified the rm command to prevent people from shooting themselves in the foot.


I thought they changed it to prevent buggy scripts from executing rm with empty variables? That's not necessarily the fault of the user.


> To me, this is no different than when FreeBSD and GNU modified the rm command to prevent people from shooting themselves in the foot.

Which was the right change. It prevents data loss for negligible added complexity.


How can a program determine whether its output is being piped to a file/program, or the terminal?

Edit: It would appear that the answer is `isatty()`


Came here to say this. I can imagine a lot of people instead of doing -o were just redirecting output... if this changes the behavior of that, then a lot of people are not going to be happy when a curl upgrade breaks scripts.


It won't because 'isatty' returns false for a pipe. So that should not break anything unless you plan on sending binary output to your terminal which is a bad idea anyway, but they allow you to override the default as per TFA.


You can trick isatty, but only if you're very, very set on doing so. It requires allocating a psuedo-TTY, and attaching it.

cURL already use isatty checks elsewhere, such as to determine if progress info should be printed (will only show if you're not outputting to tty).


Sure, but that's exactly what a pseudo-TTY is supposed to do, pretend it is a TTY when actually it is a pipe so isatty will return true.

So that's not so much tricking isatty as much as using the system in the way it is intended. If you do that without the required background information then you have only yourself to blame.


A pipe is different from a pseudo-TTY. Pseudo-TTYs are how all graphical terminals work, but all shells use simple pipes on redirection and pipe-composition. TTYs and pseudo-TTYs are supposed to be interactive, with a user sitting on the other end with a keyboard. Plenty of tools make their output different based on if they're outputting to a tty or not, e.g. ls and others do not colourise by default if not outputting to a tty, as colour control characters are usually not what you want dumped to a text file or piped to grep. I feel this change brings curl more in-line with other terminal programs that try to be good citizens.


Psuedo-TTY's come in pairs, one end for the application to connect to, the other for whatever is consuming the output or producing input to the application. So, yes, of course they are not pipes but they are similar in that you have two end-points. So pipes and pseudo-tty's have quite a bit in common, thinking of a pseudo-tty pair as a pipe that knows how to deal with a bunch of terminal specific ioctl calls is a pretty good approximation.


Considering it a pipe is a very rough approximation, though. The only purpose of PTYs is to provide access to a TTY with all its bells and whistles (kernel provided line editing, yay!) in a world without hard terminals.


The psuedo-TTY subsystem is designed to allow user sessions with virtual, rather than hard terminals. Telnet, ssh, terminal emulators. It provides everything you would expect from a tty. Cooked mode, baudrate control (despite redundant), all that jazz.

Its purpose is not just to make isatty return true. If you spin up that giant subsystem to trick an application into a different behavior on a simple pipe, then I believe it is appropriate to call it a hack, trick, workaround or similar.


> You can trick isatty, but only if you're very, very set on doing so. It requires allocating a psuedo-TTY, and attaching it.

You can also LD_PRELOAD an isatty shim that returns true. This is occasionally useful when piping the output of tools which disable ANSI colorization with no switch to re-enable that.


"script" is also a quick and dirty hack. It's a small tool meant to record a shell session, but if you set the output to /dev/null, it is effectively just a pty wrapper. Quicker than writing a shim.


Never thought of that, that is a neat trick, thank you.


And note that isatty() works on a file descriptor, and that you might be outputting to both stderr and stdout, and both, neither, or some weird combination might be a tty. (i.e., you need to check for the file descriptor you're going to output to.)

Some programs will check if stdout is a tty prior to writing to stderr, for example, which means that if you're doing,

  foo | bar
foo's stderr will go to the tty, but would lose coloring, for example. For example, `babel`, a JS transpiler, loses the color in its error messages when you pipe its output.


Hooray! I'm gonna go ahead and claim credit for the festure request (though I never attempted to implement it :)

https://curl.haxx.se/mail/archive-2014-02/0032.html


There are a number of download urls that are "one-time" use. Shouldn't this at least download the output to a /tmp file?


If you're relying on the previous behaviour of curl on your one-time URL, you're already in trouble, because it's been printed to a terminal and the real contents are probably irretrievable.


No. This would open an enormous can of worms, but I'll just pick one: security.

And if you're the kind of person that needs this, the solution is a one-line alias.


I have wondered if there are security hazards to piping binary to your terminal.

I guess what really bad things can happen (and no I'm not talking about blindly piping to a shell to eval)?

Sure it messes up the terminal of which I usually type clear and/or reset. If it's really bad I usually just kill the terminal.


Many terminal emulators are not very good at dealing with unstrusted input.

http://www.openwall.com/lists/oss-security/2017/05/01/13


There used to be some terminal emulators that supported "screen dump", where the scrollback buffer would be written to an arbitrary named file.

echo -e "\ec\n\e]55;/tmp/ouch.php\a"

I believe they've all removed that functionality by now, but there might be some older copies around here and there.


IIRC, another angle that has been proposed is to switch between the primary and secondary terminal buffers. This way, one could create a file that looks harmless when piped to the terminal but contains a hidden, malicious payload. The victim would, after manual inspection, finally pipe the file to a shell, where the payload would do something evil.


> The victim would, after manual inspection, finally pipe the file to a shell

Why would anyone do this? If I see some unknown, interesting file, I might run cat, head, tail, less or vim on it. If it's binary then maybe I'll use xxd. But it wouldn't even occur for me to pipe it to a shell.


As people tend to re-use shell commands, editing them as they go, I can totally see someone doing like curl XX | less and then curl XX | sh or something. If the download is relatively speedy.

You could argue this is one of the reasons less doesn't interpret control codes by default. As it would let applications hide stuff like that to redraw the screen.


You could use the script command. (Which is also very helpful for running screen under su.)


Checking for the NUL byte seems like a very arbitrary test. Shouldn’t it be using isprint_l()? Isn’t that, in fact, what isprint_l() is for?


Great feature. HTTPie, a CLI HTTP client I created, has always had a similar one: No binary data in terminal output, and on top of that, you get to see the response headers by default as well.

https://httpie.org/doc#binary-data


Why not check for ESC (0x1b, ^[) bytes instead? Without those, you can't mess up your terminal that badly.


Now you can annoy curl users by putting null byte in the beginning of the document.


There are some valid use cases here, I hope this will not affect users of wopr (https://github.com/yaronn/wopr).


Is there going to be a ~/.curlrc option so I can turn this off? Else I suppose I can add -o- to my curl alias.


I'll probably just add "| cat" on in the rare case when I want unfiltered output, then I don't need to memorize any options. That's what I do with ls when I temporarily don't want colorized output.


-o- seems like a terrible idea in a curl alias. Why do you want cURL to print binary data to your terminal? The check disables if you're redirecting output (> or |).


You could simply write a wrapper that reads an options file.

In general, it's a bad idea to have a hidden state such as a configuration file. Imagine, for examples, scripts that expect the standard curl behaviour, but your configuration file changes that.

(Of course setting shell aliases has the same problem...)


Shell aliases do not have that problem as they're generally used with interactive shells only. Non-interactive shells tend to read a different set of configuration files for that reason.


what situation would you want this?


I'd prefer they stop curl from detecting a pipe on STDOUT and assuming that the user must want download progress garbage. Why would I want it in a pipe any more than I'd want it standalone? Inconsistent behaviour violates POLS.


The GNU Coding Standards adresses this problem:

please don’t make the behavior of a command-line program depend on the type of output device it gets as standard output or standard input.

[…]

There is an exception for programs whose output in certain cases is binary data. Sending such output to a terminal is useless and can cause trouble. If such a program normally sends its output to stdout, it should detect, in these cases, when the output is a terminal and give an error message instead. The -f option should override this exception, thus permitting the output to go to the terminal. ”¹

1. https://www.gnu.org/prep/standards/standards.html#User-Inter...


I don't know about the version of curl you use, but mine doesn't give download progress in binary format, and doesn't contain any content of the target you're curling.

    $ curl news.ycombinator.com > /dev/null
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   178    0   178    0     0    369      0 --:--:-- --:--:-- --:--:--   370
You have to add an extra arg to get rid of that progress meter.


The behavior doesn't exactly switch on pipe detection though. The progress bar consistently shows when outputting to a file.

  $ curl news.ycombinator.com -o /dev/null
    % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                   Dload  Upload   Total   Spent    Left  Speed
  100   178    0   178    0     0    948      0 --:--:-- --:--:-- --:--:--   951
I mean, it's effectively doing tty detection, but I think it's fair to argue a semantic difference, and it's very similar to how ls strips colors and disables listing format when piped, which is mentioned by the next paragraph in the coding standards.

Compatibility requires certain programs to depend on the type of output device. It would be disastrous if ls or sh did not do so in the way all users expect. In some of these cases, we supplement the program with a preferred alternate version that does not depend on the output device type. For example, we provide a dir program much like ls except that its default output format is always multi-column format.


It's perfectly plausible that the user is piping it to a tool which can parse whatever binary format the file is using. Outputting it to the terminal, however, will just result in a mixture of unprintable characters and ANSI escape sequences.


should have used `-O - ` so the options is consistent with wget


`-O` is already used. `curl -O <URL>` is roughly equivalent to `wget <URL>`.


the fact that cURL is hosted on haxx.se always throws me. It makes me immediately associate haxx === hacking and I feel uncomfortable. This is honestly why I use other tools instead even though it's more an unconscious decision than a conscious one.


Yet you post on a site called Hacker News.


Ummm. You nailed me there.


And people wonder why integrity of the "news" media is important xD


I would never hire someone that properly doesn't use industry standard software that's purpose-built for the task just because the site name "makes them uncomfortable."


cURL is a good utility but in my view calling it industry standard is a stretch. Personally I like httpie and postman.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: