Could you expand on this? What's a memoized token stream?

seanmcdirmid · on Jan 5, 2015

A token stream that preserves token identities. Say you have a token stream of [t0, t1] and add a new token t2 between t0 and t1 to get [t0, t2, t1]. What you want is to be able to identify t0 and t1 in the new stream as being the same tokens as in the old stream. If you simply re-lex to get new tokens, that won't happen, and if you use character offsets as your token identities, t1 in the new and old stream will have different identities.

Incremental lexing is pretty easy: just convert the tokens around the damage back into text, re-lex that text, and then work from front and back of the new partial stream to replace new tokens with old existing tokens if their lexical categories match (e.g. an old identifier's text can be updated with a new identifier's contents, but that is ok because parsing only depends on the fact that the token is an identifier). You might not win any performance awards, but those reused old tokens make very good keys into various data structures for incremental parsing, incremental type checking, and even incremental re-execution (if you are into live programming).

vorg · on Jan 5, 2015

> convert the tokens around the damage back into text, re-lex that text, and then work from front and back of the new partial stream

I haven't tried it and I'm no parsing expert but it seems when a rule is specified "forwards" (i.e. beginning-to-end) then it can be used for lexing (or parsing) forwards but there'd be problems using it to lex (or parse) backwards. Hence the first "L" in "LL" and "LR" parsing.

Maybe an algorithm could use both directions running in parallel and create 2 structures/memoizations, one for the forward parse and one for the backward, hence a dual LL/RL or LR/RR parse, and access both structures to repair "damage". However, there could be another problem: if the lookahead for some lex or parse rule is 1, couldn't that translate to any lookbehind value when lexing or parsing backwards? An LL(1) grammar might require an equivalent RL(k) grammar for some very high k. Of course you'd also need to avoid right recursion as well as left recursion for topdown parsing.

I haven't tried any of this but these are the things that cross my mind whenever I think of incremental parsing!

skybrian · on Jan 5, 2015

Thanks. To figure out what's damaged, it seems like you have to do a diff somewhere? It sounds like this is done at the character level rather than the token level?

seanmcdirmid · on Jan 5, 2015

No diff, just do the re-lex and damage the tree on the boundaries of where the user typed. My techniques (also, see glitch) usually just assume change has happened since the reprocessing is incremental anyways and won't cascade if nothing has really changed (when reprocessing a tree, the parent is damaged if the end token changed, I guess you could call that a diff).