Hacker News new | past | comments | ask | show | jobs | submit login

New multimodal models take raw speech input and provide raw speech output, no tts in the middle.



A relatively detailed description of such systems: https://arxiv.org/abs/2410.00037


Seems like the future - so much meaning and context is lost otherwise.


Very cool. Logical next step. Would be interested to know what the dataset looks like.


Youtube. Youtube is the dataset.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: