I would recommend that the author read up on NFAs and DFAs -- they are a formalism better suited to lexers than tries.
At a high level, if compiling a lexer for a run-of-the-mill language like JavaScript takes 5 minutes and 2.5GB of RAM, you are most likely doing it wrong. By "doing it wrong," I mean that there is almost certainly a better approach that is far superior in every measurable aspect (CPU, memory, code size, etc).
I don't fully understand what kind of algorithm the author was using, so I can't comment in detail on it, but in general lexers are better thought of as finite automata (NFAs and DFAs) than tries. The two are related, but unlike tries NFAs and DFAs can have cycles, which are essential for representing a lexing state machine efficiently.
Another observation: it's not too terribly surprising that you could beat preg_match_all() with the latter being given a huge regular expression. Most regex engines are backtracking (RE2 being a notable counterexample), which means that a regular expression with high-cardinality alternation (ie. A|B|C|D|E|F|G ...etc) is one of their worst cases. This isn't what they are designed for. A backtracking regex engine will basically try each alternative in sequence until that alternative no longer matches, then back up and try the next one. NFA/DFA based parsers will be much faster for this case.
The right tool for this job, as another commenter mentioned, is Ragel. It's basically designed for exactly this. It doesn't appear to support PHP though...
> I would recommend that the author read up on NFAs and DFAs -- they are a formalism better suited to lexers than tries.
Author here. The actual regex implementation uses a NFA. The start of it used a Trie, but it moved away.
The majority of what I wanted to get across here was the use of a minimal structure (single-character).
The next step was using a maximal radix implementation (as long of a prefix as possible). Then finally, throwing all of it away and going straight to parsing using a state machine.
I may be mixing this language up with another text-manipulation language starting with 'R' (not being coy—I genuinely don't remember its name), but I seem to remember being uncertain of the reach of this project.
The Wikipedia page links two publications by (I suppose) the creator of Ragel, and two publications that do not seem to be specifically about Ragel; and the product page mentions nothing in the way of publications or references to uses in industry. Is Ragel actually widely used?
Tries are amazing data structures, simple and extraordinarily fast -- O(m) on look ups. But they also eat memory at extraordinary rates as well. They're a classic speed vs. memory data structure.
However, most people use naive Tries, just adding elements down a branch until they exhaust the string they're inserting.
One easy optimization to make with Tries is to set a maximum branch length (based on some statistical analysis of lexeme usage. For example, make 90% of your lookups reachable under that length), any lexeme longer than that length simply gets hung off of the end of the branch in a more space-efficient data structure (like a hash table).
Your lookup then is then still O(m) for anything under the maximum branch length, and things longer are still just O(m)+O(n) or whatever.
But your memory usage will shrink dramatically. And you can improve it by fiddling with your branch length and choose say 80% reachable without hashing.
Indeed there are a lot of good optimization options for tries.
At a little bit of a speed penalty you can use a bitmap for navigation instead of pointers, which obliterates the memory footprint. In some use cases the result is actually smaller than the equivalent Bloom filter with a sane FP rate (though generally not smaller than a Golomb-compressed set or cuckoo filter).
A compromise between the speed and memory bloat of pointers is to use Radix/Patricia tries, which are a solid go-to option. In some cases they'll be slower than naive tries, but in others they'll be faster.
The fastest and most memory-efficient implementation I've seen, though, boiled each pattern in the trie down to its most unique dword or qword, and only attempted a full match after an initial hit. Locality of reference can work wonders here -- we got a 20x speedup over an already very fast naive trie this way.
> boiled each pattern in the trie down to its most unique dword or qword, and only attempted a full match after an initial hit
Oh that's interesting - like a very fast cmp before bothering to check the rest of the branches.
When you say pattern are you talking about specific letters/morphemes or subsequence? And were you pulling the [dq]words from the initial pattern or generating them some otherway (through some kind of hash or encoding function?).
Sorry, should've given more context. In this case the trie was for Aho-Corasick, so we built it from the most unique subsequences. Since there was no need to add or remove patterns at runtime, the subsequences were precomputed and the trie built server-side, then sent serialized down to the application.
On a different occasion I tried a few kinds of entropy coding, but their value was pretty limited, and obviously not applicable if you need to operate on a stream with Aho. In that case I only needed memory-optimized set membership testing, though, so ended up going with a Golomb-compressed set from pgs. 7-8 of [0].
They can eat memory at extraordinary rates, but I've successfully used them to reduce memory usage as well. If you need to store a mess of strings that tend to have a lot of duplication toward the front - URIs, for example - then a trie might fare pretty well in that department.
A DAWG can be even better, at the expense of considerable setup cost.
Though a naive Trie can be converted to a DAWG incrementally and on-the-fly relatively easily. If you're careful, you can even do this on another thread in the background.
One thing about tries that's really nice is that you can transparently combine nodes to turn it into a DAWG.
Especially as you can do it on-the-fly. Memory usage is getting excessive? Stop and do a suffix combination pass until it's decent again. Otherwise? Don't bother.
[From all the bad things I hear about PHP, the code is very readble without any previous experience - nice].
Here are some things a lexer for a programming language might have to deal with:
1. Comments (some even do nested - which means regular expressions are out for that).
2. Continuation lines.
3. Includes (if done at the lexical level).
4. Filename/line/column number for nice error messages (can really hurt with branch mispredictions).
5. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings (with escapes), etc.
6. Identifiers.
So matching keywords is mostly the straightforward part. However I have found that matching many keywords is the perfect (and in my experience so far, the only) use case for a perfect hashing tool like gperf - it would normally be much faster than any pointer-chasing trie. gperf mostly elminated keyword matching from the profile of any lexer I've done.
Some languages allow escapes before everything else. Looking at you, Java. So either you need to do a pass beforehand to unescape them, or unescape characters (and in the process do error handling / etc!) on-the-fly.
A lexer for a language with a lot of keywords leads to a large representation as an automaton as the author has experienced. One way to deals with this is to only recognise in one rule all identifiers including keywords (something like "[a-z_][a-zA-Z0-9_]*" and to use a hash table of keywords to check whether a match is a keyword (and which one) or an identifier .
Edit: fixed the regexp to allow for single-char identifiers.
> One way to deals with this is to only recognise in one rule all identifiers including keywords (something like "[a-z_][a-zA-Z0-9_]+" and to use a hash table of keywords to check whether a match is a keyword or an identifier (and which one).
A lot of production parsers I've seen do this with two state machines. They'll have a top-level state machine (usually hand-implemented using a giant switch statement) that handles all of the main different token types—numbers, strings, punctuators, etc. That has a single type for all identifiers, including keywords.
Then, after it determines a token is a identifier, it runs another little tiny optimized state machine to see if it's a reserved word or not.
So I get that this optimized giant trie might be faster than a regex. But what about a normal lexer, either handwritten or generated? Shouldn't that be faster still than the trie? I mean, that giant amount of memory alone must cause lots of performance issues...
Yes. For a given definition of "word", anyway, but "word" is fiddly enough we can assume they mean something sensible.
A stricter definition would be "lexers work at the lexical level" for NLP (in which case it's nearly tautologically obvious), or "lexers work at the token level" for programming languages.
(Creds: lexing / wordbreaking is the topic of my masters research in linguistics. Tries are a good starting point for that, but it turns out that they are a special case of simple morphological analyzer, and you need more complicated ones to do a good job on natural languages.)
A lexer is a fold as well - there are usually fewer elements in the output list than in the input list. A mapping function implies a 1:1 correspondence, because the recursion is abstracted out into map() and the map function can't accumulate any state during the computation. (At least in traditional FP technology; the Mapper in MapReduce isn't actually a true map function, it's more like an unfold.)
A lexer is for a regular language (in the formal sense), whereas a parser is for a context-sensitive (or context-free) language.
Or, for a less formal description. A lexer's transformation is one that can be expressed as a function of <<input symbol or end of file>, position in file, state> -> <<some finite number of output symbols (including zero)>, <state or done>>, where state is finite. A lexer can work without backtracking or arbitrary memory usage. Whereas a parser potentially requires either arbitrary (but finite) backtracking or arbitrary (but finite) memory usage.
The boundary gets a little fuzzy, though. For instance, constants, at least for any language that allows arbitrary-sized integers.
I guess only if you consider punctuation to be "words". A more accurate description would be "lexers work at the atom level", IOW breaking the input into atoms - the smallest non-breakable lexical units that are used in a grammar.
At a high level, if compiling a lexer for a run-of-the-mill language like JavaScript takes 5 minutes and 2.5GB of RAM, you are most likely doing it wrong. By "doing it wrong," I mean that there is almost certainly a better approach that is far superior in every measurable aspect (CPU, memory, code size, etc).
I don't fully understand what kind of algorithm the author was using, so I can't comment in detail on it, but in general lexers are better thought of as finite automata (NFAs and DFAs) than tries. The two are related, but unlike tries NFAs and DFAs can have cycles, which are essential for representing a lexing state machine efficiently.
Another observation: it's not too terribly surprising that you could beat preg_match_all() with the latter being given a huge regular expression. Most regex engines are backtracking (RE2 being a notable counterexample), which means that a regular expression with high-cardinality alternation (ie. A|B|C|D|E|F|G ...etc) is one of their worst cases. This isn't what they are designed for. A backtracking regex engine will basically try each alternative in sequence until that alternative no longer matches, then back up and try the next one. NFA/DFA based parsers will be much faster for this case.
The right tool for this job, as another commenter mentioned, is Ragel. It's basically designed for exactly this. It doesn't appear to support PHP though...