Thanks for the insight, also for mentioning convolutional LSTM, I wasn't aware such a thing existed.
> Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.
But aren't CNNs also like a memory module (ie: they memorize how leopard skin looks like)? I guess attention is a more sophisticated kind of memory, "more dynamic" so to speak.
Anyway, I'm glad to hear that a transformer architecture isn't totally stupid for my task, I will look up the literature, there seems to be a bit on this matter.
Yeah, in some sense any layer is a "memory module". Perhaps more specifically, attention solves the problem of directly correlating two items in a sequence that are very, very far away from each other. I'd generally caution against using attention prematurely as it's extremely slow, meaning you'll waste a lot of your time and resources without knowing if it'll help. Stacking conv layers or using recurrence is an easy middle step that, if it helps, can guide you on whether attention could provide even more gains.
> Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.
But aren't CNNs also like a memory module (ie: they memorize how leopard skin looks like)? I guess attention is a more sophisticated kind of memory, "more dynamic" so to speak.
Anyway, I'm glad to hear that a transformer architecture isn't totally stupid for my task, I will look up the literature, there seems to be a bit on this matter.