This is interesting to me as I am developing a tool that parses files in search for bugs and I need to ignore binary files. What I am doing right now: when checking a line, if it is not a valid UTF-8 string I skip the file. It's not really nice as I am doing this verification for every file's lines...
1. It is not sensible to check more than a small chunk of data, as it would result in significant performance penalties, both in high resource consumption (memory or temporary files) and in blocked pipeline (cURL streams data as it is received). Imagine `curl http://somesite/10GBoftext | grep "rarely-occurring-prefix" | do-something-with-found-data`—with a full scan, curl would use 10GB of memory, checking every byte before it sent things to grep and do-something. Without a full scan, do-something processes data live, and curl uses negligible resources.
2. Then the check yields a false negative, which is not a problem.
3. Then your UTF-8 string is unprintable, and the check will yield a true positive. The UTF-8/ASCII NUL character is not printable, despite being valid.
If one only assumes ASCII/UTF-8/Shift-JIS/similar, then a blob containing a null byte is guaranteed to be unprintable, while a blob not containing a null byte may be printable. That's good enough for a warning, telling you that you're doing something that you might not have intended.
Given that UTF-8 has become standard, it means that you will never realistically get a false warning, but may still get bonkers output. You can always overrule if you have a fetish for UTF-32.
> You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?
Of course it does. You can curl the concatenated content of the library of congress to your terminal if you want to.
> the binary will be printed on the screen, that's a problem
No. Because printing the binary to screen is the current behaviour in all cases, the goal of this change is to reduce the incidence of it for quality of life.
> How is it that the zero byte is part of UTF-8 then?
Flash News: unprintable characters are part of unicode. NUL is one of them.
> How is it that the zero byte is part of UTF-8 then?
It is a valid code point, just not a printable character. Unicode encodes every character that is or was in common use, not just the printable characters; this includes the control characters at the beginning of the ASCII table.
The announcement says that "curl will inspect the beginning of each download", and I think that comparison just turns off the check after at least 2000 bytes have already been output (see a few lines below the change you quoted, where outs->bytes is incremented by the amount of bytes that were output).
what if your output is binary but doesn’t contain a byte 0?
I guess curl will incorrectly recognize the binary as text.
what if your output is a normal UTF-8 string but contains a byte 0?
I guess curl will incorrectly recognize the text as binary, and you can use `-o -` to override that and output to the terminal anyway.
Are you sure you can do without that null pointer check? I would have moved it to before the pointer dereference in the initialization of isatty (if you fear or know some compilers won't accept local declarations after statements in functions, split declaration and assignment of isatty, too)
It means that a UTF-8 string can contain the "NUL" character, which is a single null byte. While valid UTF-8, it is still an unprintable character, making it "binary" for most intents and purposes.
It's worth noting that, while NUL is valid UTF-8, there's a lot of software which will fail strangely when presented with it, since UTF-8 is so often used to interoperate with NUL-terminated C strings. If you write such a string to a C string then it will be silently truncated.
I ran into this once with an Objective-C program that created filenames from strings found in files. When presented with a string containing NUL, the code ran, but didn't really work. I'd get "foo" and then ask it to append ".txt" and the result would come back as "foo" still! And it depends on the context in which you use it. Using it as a filename truncated, because you ultimately go through the POSIX-level calls that use C strings. Printing it in the debugger truncated, as evidently it used C strings at some point in that process. Displaying it in a text field in the UI worked perfectly fine, though, as apparently that path never uses C strings and the NUL character is just an invisible zero-space character in the middle.
> what if your output is a normal UTF-8 string but contains a byte 0?
It's not really a valid character to print to a terminal and most terminals ignore it nor is it particular valid in a "text file".
As for the rest of your questions if the bytes in the file are randomly distributed which is common with compressed binary files the chance that there is not a zero byte in the first 2000 bytes is 0.03% which seems low enough.
> if the bytes in the file are randomly distributed
This is rarely true I believe, but binary files should often use the 0 byte because it sounds "practical" to use (going with the instinct here). So I'm guessing this test is "good enough"
I would agree as well that the code looks great and quite "idiomatic C"; but I'd prefer using 2048 and put the memchr() in the same if with another &&.
I think putting the memchr() in the same if would cause unnecessary calling of the memch() function (and scanning of 2000 bytes) even when any of the other conditions is false! It will work great on a language like ruby, not C.
I think this is a bad idea. Trying to make programs intelligent and have them behave in unexpected ways usually is.
It will make curl fail if a 0 byte is outputted within the text. There are many reasons why this will happen. Software errors, UTF-Encoding, special use cases...
In fact, I run into this problem with grep from time to time and it is super annoying:
A NUL byte can't print in a terminal or show in an editor, so treating it as binary is a sensible thing to do when you are asking cURL to print it to your terminal. It is no longer text if a NUL byte is present (unless you use UTF-16/32, in which case, WHY?!).
I assume that you also inserted a NUL byte into your somefile.txt to make grep treat it as binary.
cURL behaves predictably now as well as before. It will warn you if you try to print unprintable data to your terminal, but can be told to ignore it, and won't affect output redirects.
Many legacy and contemporary programs change their output behaviour based on if they're outputting to a tty or to a pipe. Off the top of my head, ls (as old as you can get) and git (new) do not colourise their output by default for pipes, but do colourise for terminals. This is just curl trying to be a good citizen on the terminal, for which there is plenty of precedent.
My coworkers and I used curl, and the linux kernel, as part of interviews. The candidate was asked to download the source to the Linux kernel during the interview; occasionally they would curl it to the tty. (We don't ding them for this, as I do it all the time myself… In fact, if anything, it provides a great opportunity to see if they know about reset, or the details of their mistake (e.g., why did that have so many odd effects on the terminal? why did the title change? etc.)) Surprisingly, the binary output would screw up tmux beyond repair by reset. (There would be characters printed outside the "terminal"'s area, for example. We suspect tmux has a few bugs because of this.)
I am sympathetic to your comment because it feels like we are babying users. To me, this is no different than when FreeBSD and GNU modified the rm command to prevent people from shooting themselves in the foot.
Came here to say this. I can imagine a lot of people instead of doing -o were just redirecting output... if this changes the behavior of that, then a lot of people are not going to be happy when a curl upgrade breaks scripts.
It won't because 'isatty' returns false for a pipe. So that should not break anything unless you plan on sending binary output to your terminal which is a bad idea anyway, but they allow you to override the default as per TFA.
Sure, but that's exactly what a pseudo-TTY is supposed to do, pretend it is a TTY when actually it is a pipe so isatty will return true.
So that's not so much tricking isatty as much as using the system in the way it is intended. If you do that without the required background information then you have only yourself to blame.
A pipe is different from a pseudo-TTY. Pseudo-TTYs are how all graphical terminals work, but all shells use simple pipes on redirection and pipe-composition. TTYs and pseudo-TTYs are supposed to be interactive, with a user sitting on the other end with a keyboard. Plenty of tools make their output different based on if they're outputting to a tty or not, e.g. ls and others do not colourise by default if not outputting to a tty, as colour control characters are usually not what you want dumped to a text file or piped to grep. I feel this change brings curl more in-line with other terminal programs that try to be good citizens.
Psuedo-TTY's come in pairs, one end for the application to connect to, the other for whatever is consuming the output or producing input to the application. So, yes, of course they are not pipes but they are similar in that you have two end-points. So pipes and pseudo-tty's have quite a bit in common, thinking of a pseudo-tty pair as a pipe that knows how to deal with a bunch of terminal specific ioctl calls is a pretty good approximation.
Considering it a pipe is a very rough approximation, though. The only purpose of PTYs is to provide access to a TTY with all its bells and whistles (kernel provided line editing, yay!) in a world without hard terminals.
The psuedo-TTY subsystem is designed to allow user sessions with virtual, rather than hard terminals. Telnet, ssh, terminal emulators. It provides everything you would expect from a tty. Cooked mode, baudrate control (despite redundant), all that jazz.
Its purpose is not just to make isatty return true. If you spin up that giant subsystem to trick an application into a different behavior on a simple pipe, then I believe it is appropriate to call it a hack, trick, workaround or similar.
> You can trick isatty, but only if you're very, very set on doing so. It requires allocating a psuedo-TTY, and attaching it.
You can also LD_PRELOAD an isatty shim that returns true. This is occasionally useful when piping the output of tools which disable ANSI colorization with no switch to re-enable that.
"script" is also a quick and dirty hack. It's a small tool meant to record a shell session, but if you set the output to /dev/null, it is effectively just a pty wrapper. Quicker than writing a shim.
And note that isatty() works on a file descriptor, and that you might be outputting to both stderr and stdout, and both, neither, or some weird combination might be a tty. (i.e., you need to check for the file descriptor you're going to output to.)
Some programs will check if stdout is a tty prior to writing to stderr, for example, which means that if you're doing,
foo | bar
foo's stderr will go to the tty, but would lose coloring, for example.
For example, `babel`, a JS transpiler, loses the color in its error messages when you pipe its output.
If you're relying on the previous behaviour of curl on your one-time URL, you're already in trouble, because it's been printed to a terminal and the real contents are probably irretrievable.
IIRC, another angle that has been proposed is to switch between the primary and secondary terminal buffers. This way, one could create a file that looks harmless when piped to the terminal but contains a hidden, malicious payload. The victim would, after manual inspection, finally pipe the file to a shell, where the payload would do something evil.
> The victim would, after manual inspection, finally pipe the file to a shell
Why would anyone do this? If I see some unknown, interesting file, I might run cat, head, tail, less or vim on it. If it's binary then maybe I'll use xxd. But it wouldn't even occur for me to pipe it to a shell.
As people tend to re-use shell commands, editing them as they go, I can totally see someone doing like curl XX | less and then curl XX | sh or something. If the download is relatively speedy.
You could argue this is one of the reasons less doesn't interpret control codes by default. As it would let applications hide stuff like that to redraw the screen.
Great feature. HTTPie, a CLI HTTP client I created, has always had a similar one: No binary data in terminal output, and on top of that, you get to see the response headers by default as well.
I'll probably just add "| cat" on in the rare case when I want unfiltered output, then I don't need to memorize any options. That's what I do with ls when I temporarily don't want colorized output.
-o- seems like a terrible idea in a curl alias. Why do you want cURL to print binary data to your terminal? The check disables if you're redirecting output (> or |).
You could simply write a wrapper that reads an options file.
In general, it's a bad idea to have a hidden state such as a configuration file. Imagine, for examples, scripts that expect the standard curl behaviour, but your configuration file changes that.
(Of course setting shell aliases has the same problem...)
Shell aliases do not have that problem as they're generally used with interactive shells only. Non-interactive shells tend to read a different set of configuration files for that reason.
I'd prefer they stop curl from detecting a pipe on STDOUT and assuming that the user must want download progress garbage. Why would I want it in a pipe any more than I'd want it standalone? Inconsistent behaviour violates POLS.
“please don’t make the behavior of a command-line program depend on the type of output device it gets as standard output or standard input.
[…]
There is an exception for programs whose output in certain cases is binary data. Sending such output to a terminal is useless and can cause trouble. If such a program normally sends its output to stdout, it should detect, in these cases, when the output is a terminal and give an error message instead. The -f option should override this exception, thus permitting the output to go to the terminal. ”¹
I don't know about the version of curl you use, but mine doesn't give download progress in binary format, and doesn't contain any content of the target you're curling.
$ curl news.ycombinator.com > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 178 0 178 0 0 369 0 --:--:-- --:--:-- --:--:-- 370
You have to add an extra arg to get rid of that progress meter.
The behavior doesn't exactly switch on pipe detection though. The progress bar consistently shows when outputting to a file.
$ curl news.ycombinator.com -o /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 178 0 178 0 0 948 0 --:--:-- --:--:-- --:--:-- 951
I mean, it's effectively doing tty detection, but I think it's fair to argue a semantic difference, and it's very similar to how ls strips colors and disables listing format when piped, which is mentioned by the next paragraph in the coding standards.
Compatibility requires certain programs to depend on the type of output device. It would be disastrous if ls or sh did not do so in the way all users expect. In some of these cases, we supplement the program with a preferred alternate version that does not depend on the output device type. For example, we provide a dir program much like ls except that its default output format is always multi-column format.
It's perfectly plausible that the user is piping it to a tool which can parse whatever binary format the file is using. Outputting it to the terminal, however, will just result in a mixture of unprintable characters and ANSI escape sequences.
the fact that cURL is hosted on haxx.se always throws me. It makes me immediately associate haxx === hacking and I feel uncomfortable. This is honestly why I use other tools instead even though it's more an unconscious decision than a conscious one.
I would never hire someone that properly doesn't use industry standard software that's purpose-built for the task just because the site name "makes them uncomfortable."
So a few things:
* what if your output is more than 2000 bytes?
* what if your output is binary but doesn’t contain a byte 0?
* what if your output is a normal UTF-8 string but contains a byte 0? ( see https://stackoverflow.com/questions/6907297/can-utf-8-contai... )
This is interesting to me as I am developing a tool that parses files in search for bugs and I need to ignore binary files. What I am doing right now: when checking a line, if it is not a valid UTF-8 string I skip the file. It's not really nice as I am doing this verification for every file's lines...