Rot8000 – Rot13 for the Unicode generation

lelf · on Nov 2, 2013

It's broken.

Λ̊1 → ⊻∪ά → Λ̊⋌

𝄞 → 뤔뷾 → 駴點

Edit: anyway, even with correct (a+b)%n it's plain bad idea.

Unicode is not English alphabet. Everything not in basic multilingual plane is broken automatically. And even in BMP there's going to be bag of glitches starting from hanging combining characters and ending to ‘oops someone normalised our string and it's now different’ (for site, not for user / Unicode).

bouk · on Nov 2, 2013

Pretty sure it's meant as a joke

rottytooth · on Nov 2, 2013

It is meant as a joke -- but also planning to fix these issues ... apparently this was the right place to bring it to find all the situations where it doesn't work correctly :)

derefr · on Nov 2, 2013

Rather than rotating through the entire BMP, I would suggest instead using Unicode's localized collations, and just rotating every character that's part of a fully-orderable "alphabet" set through that set according to those orders. (This means, for example, rotating Japanese hiragana, but not kanji.)

rottytooth · on Nov 7, 2013

CJK support is fixed now.

mischanix · on Nov 2, 2013

Not reciprocal for CJK input, e.g. "한글" takes 5 iterations to reach stability. I believe this has to do with the utf-16 encoding of codepoints > 0x10000

lelf · on Nov 2, 2013

한글 is in basic plane. It's U+D55C U+AE00

mischanix · on Nov 2, 2013

I was considering the fact that when it adds 0x8000 or whatever it's doing it's hitting 0x1.... codepoints and doing weird things with those because of the encoding. Here's a trace of 한글 through this 'rot8000', though:

한글: 0xd55c 0xae00 똼軠: 0xb63c 0x8ee0 霜激: 0x971c 0x6fc0 矼傠: 0x77fc 0x50a0 壜ㆀ: 0x58dc 0x3180 㦼በ: 0x39bc 0x1260 ᪜ㆀ: 0x1a9c 0x3180 㦼በ: (repeating)

So... yeah. Weirdness all around. Might have better luck doing this with some carefully crafted xor pad for each codepoint so that it's likely to hit a printable character but impossible to hit a character in the 0xD800..0xDFFF range (and similar ranges)... trying to "wrap" in unicode would require reinterpreting the codepoints to some continuous numeric representation.

rottytooth · on Nov 2, 2013

Something might be off in the math -- there are some work-arounds to skip control characters that might be off when starting in this range

aculver · on Nov 2, 2013

Inputting "こんにちは。元気ですか？" caused an application error:

    [ArgumentException: Error serializing value 'ᄳᅳᅋᅁᅏტ㈣䳷ᅇᄹᄫ�' of type 'System.String.']

After realizing it was "？" that was breaking everything, I ended up with this round trip:

"こんにちは。元気ですか。" → "ᄳᅳᅋᅁᅏტ㈣䳷ᅇᄹᄫტ" → "こんにちは。ጃ⷗ですか。"

It's broken. I suspect Unicode requires more careful manipulation than OP anticipated. :-)

peterwaller · on Nov 2, 2013

Copy-pasting the contents of rot8000.com/info in and hitting cypher twice ends up scrambling the contents quite a bit..

  It also bypasses 32 control characters, technically making it rot7968, sometimes with an additional offset.

->

  It also bypasses ⋍2 control characters, technically making it rot⋏⋬68, sometimes with an additional offset.

rottytooth · on Nov 2, 2013

hmm, I'm not seeing this result

rottytooth · on Nov 7, 2013

I put in a fix for CJK and the result is: nearly everything that's not CJK now rotates into it and back out; CJK is an huge section of the Basic Multilingual Plane. The fix invalidates rotations done with rot8000 before the fix, unfortunately.

njharman · on Nov 3, 2013

I just realized that 13 was probably chosen for rot13 cause that's half the number of letters in English alphabet.

I miss "obvious" stuff like that all the time.

jloughry · on Nov 2, 2013

Why not call it Rot8192 or Rot0x7777 ?

throwaway0094 · on Nov 2, 2013

rot13 is (X + (26/2)) mod 26 ; this is (X + (2^16)/2) mod 2^16. (The BMP is the first 2^16 code points of unicode.)

Edit: silly formatting dropped my math punctuation.

CUViper · on Nov 2, 2013

I suspect they also really wanted the rot8 -> "rotate" joke.

AFAICS, it's actually using decimal 8000, not 2^16/2 = 0x8000, so I don't really understand how this is reversible at all unless they're just subtracting it back.

What we really need is rot88000h for the full U+0..U+10FFFF range. :)

rottytooth · on Nov 7, 2013

It's using 0x8000, which is half of 0x10000 (the size of the basic multilingual plane). It doesn't extend out of BMP