This is interesting to me as I am developing a tool that parses files in search for bugs and I need to ignore binary files. What I am doing right now: when checking a line, if it is not a valid UTF-8 string I skip the file. It's not really nice as I am doing this verification for every file's lines...
1. It is not sensible to check more than a small chunk of data, as it would result in significant performance penalties, both in high resource consumption (memory or temporary files) and in blocked pipeline (cURL streams data as it is received). Imagine `curl http://somesite/10GBoftext | grep "rarely-occurring-prefix" | do-something-with-found-data`—with a full scan, curl would use 10GB of memory, checking every byte before it sent things to grep and do-something. Without a full scan, do-something processes data live, and curl uses negligible resources.
2. Then the check yields a false negative, which is not a problem.
3. Then your UTF-8 string is unprintable, and the check will yield a true positive. The UTF-8/ASCII NUL character is not printable, despite being valid.
If one only assumes ASCII/UTF-8/Shift-JIS/similar, then a blob containing a null byte is guaranteed to be unprintable, while a blob not containing a null byte may be printable. That's good enough for a warning, telling you that you're doing something that you might not have intended.
Given that UTF-8 has become standard, it means that you will never realistically get a false warning, but may still get bonkers output. You can always overrule if you have a fetish for UTF-32.
> You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?
Of course it does. You can curl the concatenated content of the library of congress to your terminal if you want to.
> the binary will be printed on the screen, that's a problem
No. Because printing the binary to screen is the current behaviour in all cases, the goal of this change is to reduce the incidence of it for quality of life.
> How is it that the zero byte is part of UTF-8 then?
Flash News: unprintable characters are part of unicode. NUL is one of them.
> How is it that the zero byte is part of UTF-8 then?
It is a valid code point, just not a printable character. Unicode encodes every character that is or was in common use, not just the printable characters; this includes the control characters at the beginning of the ASCII table.
The announcement says that "curl will inspect the beginning of each download", and I think that comparison just turns off the check after at least 2000 bytes have already been output (see a few lines below the change you quoted, where outs->bytes is incremented by the amount of bytes that were output).
what if your output is binary but doesn’t contain a byte 0?
I guess curl will incorrectly recognize the binary as text.
what if your output is a normal UTF-8 string but contains a byte 0?
I guess curl will incorrectly recognize the text as binary, and you can use `-o -` to override that and output to the terminal anyway.
Are you sure you can do without that null pointer check? I would have moved it to before the pointer dereference in the initialization of isatty (if you fear or know some compilers won't accept local declarations after statements in functions, split declaration and assignment of isatty, too)
It means that a UTF-8 string can contain the "NUL" character, which is a single null byte. While valid UTF-8, it is still an unprintable character, making it "binary" for most intents and purposes.
It's worth noting that, while NUL is valid UTF-8, there's a lot of software which will fail strangely when presented with it, since UTF-8 is so often used to interoperate with NUL-terminated C strings. If you write such a string to a C string then it will be silently truncated.
I ran into this once with an Objective-C program that created filenames from strings found in files. When presented with a string containing NUL, the code ran, but didn't really work. I'd get "foo" and then ask it to append ".txt" and the result would come back as "foo" still! And it depends on the context in which you use it. Using it as a filename truncated, because you ultimately go through the POSIX-level calls that use C strings. Printing it in the debugger truncated, as evidently it used C strings at some point in that process. Displaying it in a text field in the UI worked perfectly fine, though, as apparently that path never uses C strings and the NUL character is just an invisible zero-space character in the middle.
> what if your output is a normal UTF-8 string but contains a byte 0?
It's not really a valid character to print to a terminal and most terminals ignore it nor is it particular valid in a "text file".
As for the rest of your questions if the bytes in the file are randomly distributed which is common with compressed binary files the chance that there is not a zero byte in the first 2000 bytes is 0.03% which seems low enough.
> if the bytes in the file are randomly distributed
This is rarely true I believe, but binary files should often use the 0 byte because it sounds "practical" to use (going with the instinct here). So I'm guessing this test is "good enough"
I would agree as well that the code looks great and quite "idiomatic C"; but I'd prefer using 2048 and put the memchr() in the same if with another &&.
I think putting the memchr() in the same if would cause unnecessary calling of the memch() function (and scanning of 2000 bytes) even when any of the other conditions is false! It will work great on a language like ruby, not C.
So a few things:
* what if your output is more than 2000 bytes?
* what if your output is binary but doesn’t contain a byte 0?
* what if your output is a normal UTF-8 string but contains a byte 0? ( see https://stackoverflow.com/questions/6907297/can-utf-8-contai... )
This is interesting to me as I am developing a tool that parses files in search for bugs and I need to ignore binary files. What I am doing right now: when checking a line, if it is not a valid UTF-8 string I skip the file. It's not really nice as I am doing this verification for every file's lines...