Hacker News new | past | comments | ask | show | jobs | submit login

Since the tail of the line has a known format I guess we are rescued by the fact that the last 0x3B is the semicolon as the rest is just a decimal number. We can’t know the first 0x3B byte is the semicolon since the place names are only guaranteed to not contain 0x3B but can contain 0x013B. So a parser should start from the rear of the line and read the number up to the semicolon and then it can treat the place name as byte soup. Had two places shared line this challenge would have required real utf parsing and been much harder.



It's easier than you think... utf-8 guarantees that all bytes of a multi-byte character have the high bit set. 0x3B (semicolon) does not have the high bit set. Therefore 0x3B is guaranteed to be your seperator.

The same logic applies to newline - therefore, you can jump into the middle of the file anywhere and guarantee to be able to synchronize.


I'm not sure scanning backwards this helps. Running in reverse you still need to look for a newline scanning over an UTF-8 string which might plausibly contain a newline byte.

I'm no UTF-8 guru, but I think you might be possible to do this sort of a springboard for skipping over multi-byte codepoints, since as far as I understand the upper bits of the first byte encodes the length:

    byte utfByte1 = (byte) (val & 0xF0);

    if (utfByte1 == (byte) 0xF0) { // 4 byte codepoint 
      // ignore 3
    }
    else if (utfByte1 == (byte) 0xE0) { // 3 byte codepoint
      // ignore 2
    }
    else if (utfByte1 == (byte) 0xC0) { // 2 byte codepoint
      // ignore 1
    }


Yes, sorry, I'm so used to this sort of thing with the work I've been doing lately I forgot to lay it out like that. Thank you for filling it in for me.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: