I just looked at the commit and I was surprised at how nice the code is! The rel...

arghwhat · on June 17, 2017

1. It is not sensible to check more than a small chunk of data, as it would result in significant performance penalties, both in high resource consumption (memory or temporary files) and in blocked pipeline (cURL streams data as it is received). Imagine `curl http://somesite/10GBoftext | grep "rarely-occurring-prefix" | do-something-with-found-data`—with a full scan, curl would use 10GB of memory, checking every byte before it sent things to grep and do-something. Without a full scan, do-something processes data live, and curl uses negligible resources.

2. Then the check yields a false negative, which is not a problem.

3. Then your UTF-8 string is unprintable, and the check will yield a true positive. The UTF-8/ASCII NUL character is not printable, despite being valid.

If one only assumes ASCII/UTF-8/Shift-JIS/similar, then a blob containing a null byte is guaranteed to be unprintable, while a blob not containing a null byte may be printable. That's good enough for a warning, telling you that you're doing something that you might not have intended.

Given that UTF-8 has become standard, it means that you will never realistically get a false warning, but may still get bonkers output. You can always overrule if you have a fetish for UTF-32.

baby · on June 17, 2017

> 1. It is not sensible to check more than a small chunk of data

You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?

> 2. Then the check yields a false negative, which is not a problem.

the binary will be printed on the screen, that's a problem

> 3. Then your UTF-8 string is unprintable, and the check will yield a true positive.

How is it that the zero byte is part of UTF-8 then?

dozzie · on June 17, 2017

>> 1. It is not sensible to check more than a small chunk of data

> You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?

Why wouldn't it? 2000 bytes is just 25 lines by 80 characters.

baby · on June 18, 2017

right, it sounded bad for some reason!

masklinn · on June 17, 2017

> You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?

Of course it does. You can curl the concatenated content of the library of congress to your terminal if you want to.

> the binary will be printed on the screen, that's a problem

No. Because printing the binary to screen is the current behaviour in all cases, the goal of this change is to reduce the incidence of it for quality of life.

> How is it that the zero byte is part of UTF-8 then?

Flash News: unprintable characters are part of unicode. NUL is one of them.

baby · on June 18, 2017

Thanks for your non-answers :)

Freak_NL · on June 17, 2017

> How is it that the zero byte is part of UTF-8 then?

It is a valid code point, just not a printable character. Unicode encodes every character that is or was in common use, not just the printable characters; this includes the control characters at the beginning of the ASCII table.

dullgiulio · on June 17, 2017

With regards to your example (1), your commands is not a TTY, thus the null check is never performed.

teamhappy · on June 17, 2017

> I just looked at the commit and I was surprised at how nice the code is!

Most programs detect binary files like this. Here's git's version of the same function: https://github.com/git/git/blob/master/xdiff-interface.c#L19...

jrgv · on June 17, 2017

what if your output is more than 2000 bytes?

The announcement says that "curl will inspect the beginning of each download", and I think that comparison just turns off the check after at least 2000 bytes have already been output (see a few lines below the change you quoted, where outs->bytes is incremented by the amount of bytes that were output).

what if your output is binary but doesn’t contain a byte 0?

I guess curl will incorrectly recognize the binary as text.

what if your output is a normal UTF-8 string but contains a byte 0?

I guess curl will incorrectly recognize the text as binary, and you can use `-o -` to override that and output to the terminal anyway.

baby · on June 17, 2017

> I guess curl will incorrectly recognize the binary as text.

This doesn't sound like a really good test.

raimue · on June 17, 2017

Looking for a NUL byte is the standard test that is also widely used by other tools such as grep or diff.

baby · on June 17, 2017

Interesting, I'm going to try to implement that in my file parser.

Someone · on June 17, 2017

It may be nice, but it certainly is buggy:

  bool isatty = config->global->isatty;
  [...]
  if(!config)
    return failure;

jwilk · on June 17, 2017

Fixed: https://github.com/curl/curl/commit/d4cc240c19f84c

Someone · on June 17, 2017

Are you sure you can do without that null pointer check? I would have moved it to before the pointer dereference in the initialization of isatty (if you fear or know some compilers won't accept local declarations after statements in functions, split declaration and assignment of isatty, too)

skykooler · on June 17, 2017

> memchr() returns the first occurrence of the byte 0 (your second argument), or NULL.

...What if the first byte is 0? That would cause the if condition to fail, and the output would be treated as plain text.

burntsushi · on June 17, 2017

memchr returns a pointer, not an offset.

skykooler · on June 18, 2017

Ah, that explains it.

grenoire · on June 17, 2017

There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

From what I understand, this means that no wild UTF-8 string would have a NUL byte anywhere in it, no?

arghwhat · on June 17, 2017

It means that a UTF-8 string can contain the "NUL" character, which is a single null byte. While valid UTF-8, it is still an unprintable character, making it "binary" for most intents and purposes.

mikeash · on June 17, 2017

It's worth noting that, while NUL is valid UTF-8, there's a lot of software which will fail strangely when presented with it, since UTF-8 is so often used to interoperate with NUL-terminated C strings. If you write such a string to a C string then it will be silently truncated.

I ran into this once with an Objective-C program that created filenames from strings found in files. When presented with a string containing NUL, the code ran, but didn't really work. I'd get "foo" and then ask it to append ".txt" and the result would come back as "foo" still! And it depends on the context in which you use it. Using it as a filename truncated, because you ultimately go through the POSIX-level calls that use C strings. Printing it in the debugger truncated, as evidently it used C strings at some point in that process. Displaying it in a text field in the UI worked perfectly fine, though, as apparently that path never uses C strings and the NUL character is just an invisible zero-space character in the middle.

girvo · on June 18, 2017

I ran into this when writing a Docker API client in Nim via its Unix socket! Was very confused until I remembered that.

TazeTSchnitzel · on June 17, 2017

It means that UTF-8 text will only contain NUL bytes in the same situations that ASCII text does. That is, usually only in binary files.

asdfaoeu · on June 17, 2017

> what if your output is a normal UTF-8 string but contains a byte 0?

It's not really a valid character to print to a terminal and most terminals ignore it nor is it particular valid in a "text file".

As for the rest of your questions if the bytes in the file are randomly distributed which is common with compressed binary files the chance that there is not a zero byte in the first 2000 bytes is 0.03% which seems low enough.

baby · on June 17, 2017

> if the bytes in the file are randomly distributed

This is rarely true I believe, but binary files should often use the 0 byte because it sounds "practical" to use (going with the instinct here). So I'm guessing this test is "good enough"

userbinator · on June 17, 2017

I would agree as well that the code looks great and quite "idiomatic C"; but I'd prefer using 2048 and put the memchr() in the same if with another &&.

baby · on June 18, 2017

What's the advantage of 2048 over 2000?

tonmoy · on June 17, 2017

I think putting the memchr() in the same if would cause unnecessary calling of the memch() function (and scanning of 2000 bytes) even when any of the other conditions is false! It will work great on a language like ruby, not C.

userbinator · on June 17, 2017

C && is short-circuit-evaluated. Something like 'if (y && (k = x/y))' won't crash if y is 0.