struct Node<TKey, TValue>
{
uint bitmap;
KeyValuePair<TKey, TValue>[] entries;
Node<TKey, TValue>[] children;
// Node = Leaf of bitmap * children | Internal of bitmap * children
// We inline that sum into a struct where
// Leaf -> entries != null
// Internal -> children != null
}
Now consider the case where leaves are somewhat sparse. That means the children arrays will have few entries, so instead of allocating a bunch of one-element arrays for children which themselves allocate a bunch of one-element arrays of values, just inline the value arrays in the parent. Now propagate that same idea for internal nodes all the way back to the root.
Also, because these are all structs, the JIT should specialize the code generated for each level so it should be pretty fast. IIRC, the speed difference between the original and the inlined representation was noticeable but not crazy, but the memory use difference was considerable. This was all years ago though, first implementation was 2013.
I belatedly realized this might require some explanation too. Sorry, too many years of experiments and microoptimizations.
This is an unboxed representation of a sum type. Rather than relying on the CLR's inheritance to model sums I lay it out explicitly at the cost of a little space. It may seem wasteful since it looks like I'm adding 8 unused bytes to each node, but the CLR object header is 16 bytes so you're actually saving 8 bytes per node by avoiding classes and using structs that have no header! You can really see the difference in the working set if you do a full GC.Collect and check the allocated bytes after constructing some large tries.
Also, because I'm removing a heap object per node, this representation incurs one less indirection per node accessed, so I'm doing half as many loads as an equivalent class-based representation for every operation. This is a large constant factor difference, and probably partly explains why Sasa's ContainsKey is faster than yours.
Finally, no trie operatioms incur any virtual dispatch, which are instead replaced by a simple if-checks which are generally more predictable. This probably has less impact these days since branch predictors are pretty good now.
I think if you replaced your class-based representation that uses the superior CHAMP model with this unboxed representation your hash map would be the uncontested champion. You might even challenge the mutable dictionary for performance!
> You might even challenge the mutable dictionary for performance!
Don't tell me that, I'm an optimisation nut, I might actually do it! ;)
I woke up this morning with a clear understanding of what your optimisation is, so I guess I just needed to load up my brain with the details! It's definitely a super interesting approach and one that I'm surprised I've not come across before.
I'm definitely a bit wary of changing my current implementation, mostly because it's been battle hardened, but I'm sure I won't be able to resist having a go at some point. Thanks for the tip!
Do it! Do it! Do it! If only for a couple of tests to see how much it changes the results. ;-)
Unboxing sums is a nice optimization but then you can't naively use switch-patern matching to deconstruct them.
I have a bunch of other things in Sasa and other libraries you might find useful. I'm not actively working on most of them anymore except for bug fixes. I learned a lot but didn't end up using a lot of these features as much as I'd hoped.
For instance, being able to create open instance delegates in a way that automatically works around the CLR limits against such delegates to virtual methods. Some of the concurrency primitives are also interesting, as they implement an efficient atomic read/write protocols for arbitrary sized types using only volatile reads/writes (ie. avoid torn reads), and a sort of LLSC using only volatile read/write and a single interlocked inc/dec. Also, I added a kind system to CLR reflection to make working with it much easier [2].
It seems we're thinking along the same lines for numeric types. I reproduced the Haskell numeric hierarchy [1], but I put that on hold because I was thinking a [Deriving] attribute would eliminate a lot of redundancy.
Just FYI, clicking Num<A> on the main GitHub markdown page doesn't jump to the link on the markup.
Lots more to see if you're interested! I've played with parser combinators but never liked how they turned out, and settled on a simpler approach that was pretty interesting.
Also, because these are all structs, the JIT should specialize the code generated for each level so it should be pretty fast. IIRC, the speed difference between the original and the inlined representation was noticeable but not crazy, but the memory use difference was considerable. This was all years ago though, first implementation was 2013.