the recurrent model needs a mechanism to replay past context. no need to go quadratic to access all of it. they could replay multiple times to get effects similar to attention.
the hardware lottery, well... imo it's really about leveraging fully parallel training to learn how to use a memory. attention is quadratic but it can be computed in parallel. it's an end to end learned memory. getting that kind of pattern into RNNs won't be easy but it's going to be crucial before we boil the ocean.
RWKV already solve the parallel compute problem for GPU, based on the changes it has done - so it is a recurrent model that can scale to thousands++ of GPU no issue.
Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.
the hardware lottery, well... imo it's really about leveraging fully parallel training to learn how to use a memory. attention is quadratic but it can be computed in parallel. it's an end to end learned memory. getting that kind of pattern into RNNs won't be easy but it's going to be crucial before we boil the ocean.