Idea for killer app for recurrent models: low latency, low memory LLM / TTS coupling. Start decoding / generating speech as soon as new tokens are generated. When the LLM is cranking out token t, the TTS is already working on token t-1. It doesn’t have to wait. Then when the LLM is finished, the TTS is nearly finished too. The two models being colocated you just saved another network call as well.
Recurrent models with constant hidden state are naturally suited to streaming data, potentially opening the door to unexplored new use cases.
This is actually the hypothesis for cartesia (state space team), and hence their deep focus on voice model specifically. Taking full advantage of recurrent models constant time compute, for low latencies.
RWKV team's focus is still however is first in the multi-lingual text space, then multi-modal space in the future.
On a side node, and that's what led me to the link above, I wonder if it would be possible to chain N streaming LLMs in an agent workflow and get a final output stream almost instantaneously without waiting for N-1 LLM to complete their reply.
True but memory requirements grow with sequence length. For recurrent models the memory requirement is constant. This is why I qualified with "low memory".
Recurrent models with constant hidden state are naturally suited to streaming data, potentially opening the door to unexplored new use cases.