That’s fair - I’ll try to go through the weekend and write out some of the equat...

That’s fair - I’ll try to go through the weekend and write out some of the equations for the kernel that loads the weights out of the index and does the adaptor ops. It’s inspired by cross attention in retro but there are some differences for training stability and to use as an adaptor rather than training from scratch.

I consider that paper an early draft - hot off the press so to say - it needs review & editing before we would submit it to a conference. I tend to prefer a few rounds of open review before a final submission these days anyways - so appreciate the feedback

I think the main idea should be reproducible - you can repeat the randomization and generalization tests with any LLM and get similar training curves and eval results - it just wouldn’t be efficient.

We have tried it on about 5 real customer use cases with different facts and good success. Obviously we can’t publish customer data to reproduce which is why we focused on the randomization tests in the paper .

There are also some missing hyper parameters from the appendix as well we will add eventually