Hacker News new | past | comments | ask | show | jobs | submit login

During training, you have to store a dot product of Q and V that has dimension Ncrt^2.

That's quadratic scaling, no?




presumably you mean a dot product of Q and K, and no you do not have to store this: https://arxiv.org/abs/2205.14135


I mean, sure you can work around it, but from your own link:

>since the time and memory complexity of self-attention are quadratic in sequence length


Except in practice this is not true, and hasn't been for more than a year. It's not just a workaround either -- FlashAttention is both faster at runtime and uses less memory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: