Hacker News new | past | comments | ask | show | jobs | submit login

1. It is not sensible to check more than a small chunk of data, as it would result in significant performance penalties, both in high resource consumption (memory or temporary files) and in blocked pipeline (cURL streams data as it is received). Imagine `curl http://somesite/10GBoftext | grep "rarely-occurring-prefix" | do-something-with-found-data`—with a full scan, curl would use 10GB of memory, checking every byte before it sent things to grep and do-something. Without a full scan, do-something processes data live, and curl uses negligible resources.

2. Then the check yields a false negative, which is not a problem.

3. Then your UTF-8 string is unprintable, and the check will yield a true positive. The UTF-8/ASCII NUL character is not printable, despite being valid.

If one only assumes ASCII/UTF-8/Shift-JIS/similar, then a blob containing a null byte is guaranteed to be unprintable, while a blob not containing a null byte may be printable. That's good enough for a warning, telling you that you're doing something that you might not have intended.

Given that UTF-8 has become standard, it means that you will never realistically get a false warning, but may still get bonkers output. You can always overrule if you have a fetish for UTF-32.




> 1. It is not sensible to check more than a small chunk of data

You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?

> 2. Then the check yields a false negative, which is not a problem.

the binary will be printed on the screen, that's a problem

> 3. Then your UTF-8 string is unprintable, and the check will yield a true positive.

How is it that the zero byte is part of UTF-8 then?


>> 1. It is not sensible to check more than a small chunk of data

> You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?

Why wouldn't it? 2000 bytes is just 25 lines by 80 characters.


right, it sounded bad for some reason!


> You could just check the first X bytes. Also, I'm guessing curl doesn't print out to the terminal if the data is more than 2000 bytes anyway?

Of course it does. You can curl the concatenated content of the library of congress to your terminal if you want to.

> the binary will be printed on the screen, that's a problem

No. Because printing the binary to screen is the current behaviour in all cases, the goal of this change is to reduce the incidence of it for quality of life.

> How is it that the zero byte is part of UTF-8 then?

Flash News: unprintable characters are part of unicode. NUL is one of them.


Thanks for your non-answers :)


> How is it that the zero byte is part of UTF-8 then?

It is a valid code point, just not a printable character. Unicode encodes every character that is or was in common use, not just the printable characters; this includes the control characters at the beginning of the ASCII table.


With regards to your example (1), your commands is not a TTY, thus the null check is never performed.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: