Since the tail of the line has a known format I guess we are rescued by the fact...

londons_explore · on Jan 4, 2024

It's easier than you think... utf-8 guarantees that all bytes of a multi-byte character have the high bit set. 0x3B (semicolon) does not have the high bit set. Therefore 0x3B is guaranteed to be your seperator.

The same logic applies to newline - therefore, you can jump into the middle of the file anywhere and guarantee to be able to synchronize.

marginalia_nu · on Jan 4, 2024

I'm not sure scanning backwards this helps. Running in reverse you still need to look for a newline scanning over an UTF-8 string which might plausibly contain a newline byte.

I'm no UTF-8 guru, but I think you might be possible to do this sort of a springboard for skipping over multi-byte codepoints, since as far as I understand the upper bits of the first byte encodes the length:

    byte utfByte1 = (byte) (val & 0xF0);

    if (utfByte1 == (byte) 0xF0) { // 4 byte codepoint 
      // ignore 3
    }
    else if (utfByte1 == (byte) 0xE0) { // 3 byte codepoint
      // ignore 2
    }
    else if (utfByte1 == (byte) 0xC0) { // 2 byte codepoint
      // ignore 1
    }

jerf · on Jan 4, 2024

Yes, sorry, I'm so used to this sort of thing with the work I've been doing lately I forgot to lay it out like that. Thank you for filling it in for me.