the recurrent model needs a mechanism to replay past context. no need to go quad...

pico_creator · 2025-01-02T12:26:01 1735820761

RWKV already solve the parallel compute problem for GPU, based on the changes it has done - so it is a recurrent model that can scale to thousands++ of GPU no issue.

Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.