Hacker News new | past | comments | ask | show | jobs | submit login

Idea for killer app for recurrent models: low latency, low memory LLM / TTS coupling. Start decoding / generating speech as soon as new tokens are generated. When the LLM is cranking out token t, the TTS is already working on token t-1. It doesn’t have to wait. Then when the LLM is finished, the TTS is nearly finished too. The two models being colocated you just saved another network call as well.

Recurrent models with constant hidden state are naturally suited to streaming data, potentially opening the door to unexplored new use cases.




New multimodal models take raw speech input and provide raw speech output, no tts in the middle.


A relatively detailed description of such systems: https://arxiv.org/abs/2410.00037


Seems like the future - so much meaning and context is lost otherwise.


Very cool. Logical next step. Would be interested to know what the dataset looks like.


Youtube. Youtube is the dataset.


This is actually the hypothesis for cartesia (state space team), and hence their deep focus on voice model specifically. Taking full advantage of recurrent models constant time compute, for low latencies.

RWKV team's focus is still however is first in the multi-lingual text space, then multi-modal space in the future.


Karan from Cartesia explains SSMs+voice really well: https://www.youtube.com/watch?v=U9DPRZ0lSIQ

its one of those retrospectively obvious/genius insights that i wish i understood when i first met him


This can currently already be done using a streaming capable LLM with a streaming input/output TTS model.


Any LLM is "streaming capable".


https://github.com/mit-han-lab/streaming-llm

On a side node, and that's what led me to the link above, I wonder if it would be possible to chain N streaming LLMs in an agent workflow and get a final output stream almost instantaneously without waiting for N-1 LLM to complete their reply.


Any autoregressive model can do what you are describing. transformers are generating one token at a time too, not all at once.


True but memory requirements grow with sequence length. For recurrent models the memory requirement is constant. This is why I qualified with "low memory".


yes but transformers are much slower than state space models




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: