Oh interesting, didn't know. How does this work past the first transformer in th...

lonk11 · 2024-08-18T21:32:20 1724016740

My understanding is that the attention in all transformer layers is "causal" - that is the output of a transformer layer for token N depends only on tokens from 0 to N.

This means that every attention layer can use previously calculated outputs for the same prompt prefix. So it only needs to calculate from scratch starting from the first unique token in the prompt sequence.

danielmarkbruce · 2024-08-18T22:09:00 1724018940

I had the same question... my guess is you can do a layer by layer cache. Ie a cache in the first layer, then another independent second layer cache, and so on.