During training, you have to store a dot product of Q and V that has dimension N...

why_only_15 · on July 11, 2023

presumably you mean a dot product of Q and K, and no you do not have to store this: https://arxiv.org/abs/2205.14135

RC_ITR · on July 12, 2023

I mean, sure you can work around it, but from your own link:

>since the time and memory complexity of self-attention are quadratic in sequence length

why_only_15 · on July 12, 2023

Except in practice this is not true, and hasn't been for more than a year. It's not just a workaround either -- FlashAttention is both faster at runtime and uses less memory.