RWKV already solve the parallel compute problem for GPU, based on the changes it has done - so it is a recurrent model that can scale to thousands++ of GPU no issue.
Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.
Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.