Hacker News new | past | comments | ask | show | jobs | submit login
Building an invalid string in Python 2.x (gist.github.com)
141 points by bmease on Nov 21, 2013 | hide | past | favorite | 34 comments



I'm not a Python geek, but I found the C implementation for unicode strings in CPython really interesting code reading:

http://hg.python.org/cpython/file/tip/Objects/unicodeobject....

CPython supports several internal representations from one to four bytes per character to optimize for space and performance. There's also a nifty sort of Bloom filter for quick discrimination of strings that might contain characters of interest.


That's new in Python 3.3, which doesn't fall foul of that bug. This is PEP 393 (http://www.python.org/dev/peps/pep-0393/) if you want more reading about it.


When working as an AIX kernel program in 1985, I set registers to a unique value so it would be easy spot code that tried to use an uninitialized value. My choice: 0xdeadbeef. Good to see that constant is still in use.


Whenever I find myself having to change a MAC address, I end up using DEADBEEFCAFE.

I hope I never forget about changing them back and end up having to debug two different machines with the same MAC (which has actually happened to me in the wild, with two machines coming out of factory with the same MAC, talk about bad luck and shitty quality control).


I've seen duplicate MACs twice in the last few years, on two different lines of embedded/consumer electronics boards from two different factories. There was a kind of Abbot & Costello routine that went on the first time, when a Taiwanese colleague with limited English reported the problem to me.


This is a technique useful in many different places. Using enums/defines that don't start at 0 in C/C++, for instance, helps debugging when you're dealing with possible memory corruption. Likewise, making sure related enums don't overlap in values helps disambiguate logic errors and other potential bugs when those enums are used in data structures.


There's a fair few of those out in the wild.

The magic number that identifies a java .class file is 0xCAFEBABE.



Here's another list. :P

< /usr/share/dict/words perl -ne 'print if m/^[abcdefilzsbtgo]*$/ && m/^........$/;' | perl -ne 'print if !(m/i/ && m/l/);' | tr 'ilzstgo' '1125790' | tr '[:lower:]' '[:upper:]' | perl -ne 'print "0x$_"' | column


Thanks. I found I needed a bit of help for rapid scanning, so:

< /usr/share/dict/words perl -ne 'next unless m/^[abcdefilzsbtgo]*$/ && m/^........$/; next unless !(m/i/ && m/l/); chomp; print "$_ "; $_ =~ tr/ilzstgo/1125790/; print "0x$_\n"' | column

This adds the original word to the left of the hex. It makes the list much more scannable.

Tweaking the dots to count 10 shows that there are apparently 33 choices for WEP codes, if you are so inclined. (Which of course you shouldn't be, but, well...) And, alas, there do not appear to be any 64-bit constants according to my dictionary, though there's enough 32-bit coices to have some fun with phrases ("collated catcalls", "sadistic sabotage", "fattiest feedbags", "besotted ascetics", etc.). And that's just the even 8-8 phrases, 9-7 has almost as many ("godless geodesics", "falsest statistic", 0x7a55e11edb00b1e5).


< /usr/share/dict/words perl -lne 'print "$_ 0x",tr/ilzstgo/1125790/r if m/^[abcdefilzsbtgo]{8}$/ && !(m/i/ && m/l/)' | column

Thank dhart for an introduction to the "< /file program" idiom that puts file first without resorting to cat, and for column for that matter. Neat. In return I'll offer perl's new /r flag which "returns" the result of tr// or s/// rather than some count.


There are a surprising number of those which could potentially be useful, like 0x0B501373, 0xCA11AB1E, and 0x5E77AB1E.


I'd end up wanting to spell it 0x0B501337.


And no code review jury would convict you. Now, if you start padding your arrays with 0xA5B35705, OSHA might get involved.


I'm not seeing any meaningful exploits coming from this. You can maybe send a request that will fail but I can't see any sort of injection taking place.


It managed to insert invalid Unicode into a SQLite database, causing a subsequent SELECT to fail. That's at least a DoS attack.


Yes, but only if you're decoding user input as UTF-7, which would be insane.


What if you were scraping a webpage and it reported its encoding as UTF-7?


...which is, in fact, exactly how this bug was exposed.


Modern browsers don't support UTF-7 any more after a number of XSS attacks relying on inserting UTF-7 encoded script elements which then cause the document to be sniffed as UTF-7.

The only place UTF-7 is still widely used is in email clients.


> only if you're decoding user input as UTF-7

Hmm, may I ask what makes utf8 won't produce U+DEADBEEF? Or something remotely like that?

Edit:

'\xfb\x9b\xbb\xaf'.decode('utf8')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfb in position 0: invalid start byte


When UTF-8 was first defined, they didn't know how big the Unicode range was going to be, so they defined it as a 1-6 byte encoding that could encode any 32-bit codepoint.

When Unicode was deemed to end at U+10FFFF (because that's the largest value that UTF-16 can encode), UTF-8 was revised to be a 1-4 byte encoding that ends in the same place.

Python clearly implements UTF-8 in a way that uses at most four bytes per codepoint (why support five and six byte sequences if they'll never be used?). I think what we're seeing in '\xfb\x9b\xbb\xaf' is four bytes out of a six byte sequence.


It's a bug in the UTF-7 decoder that yields an invalid codepoint (outwith of the Unicode codespace) and isn't checked anywhere.


That's really cool! Character encoding issues is something we wrestle with all the time, and it is surprisingly hard to reason about all the ways supposedly "string" data are handled in the course of a typical workflow. I cringe; I hadn't even considered bugs in the encoding and decoding process itself.


Here's the bug (utf-7 decoder) so you don't have to login to github: http://bugs.python.org/issue19279


This reminds me of Godel's incompleteness theorem - which I'll poorly present as: Any system that is sufficiently complex and complete will contain legal assertions that will disprove or destroy the system. (Those that do not are not complete).

http://en.wikipedia.org/wiki/G%C3%B6del's_incompleteness_the... http://www.amazon.com/G%C3%B6del-Escher-Bach-Eternal-Golden/...


Neither throwing an exception nor having a perfectly-deterministic buggy behavior is what Godel was referring to. This shouldn't remind you of anything related to the incompleteness theorem, because it's completely unrelated.


Not completely unrelated: https://en.wikipedia.org/wiki/G%C3%B6del_numbering but what he/she was talking about is unrelated.


I don't want to be condescending, but that isn't what the theorem says. (I'm not even sure it's true.) Incompleteness means there is a true statement, that cannot be proved true inside the system.


And the contrapositive is that any sufficiently complex complete system is inconsistent.


Oh man, brokentone was right this whole time. Sorry brokentone!


I'm only able to flirt with the edges of understanding this deep philosophy/math, so I can see correlations, but not defend them


> ... will contain legal assertions...

Well, the comments state that to make this happen he had to "[exploit] a bug in the UTF-7 decoder". So, not legal assertions.


It parses, thus it is legal.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: