New multimodal models take raw speech input and provide raw speech output, no tt... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

computerex 24 days ago | parent | context | favorite | on: RWKV Language Model

New multimodal models take raw speech input and provide raw speech output, no tts in the middle.

benob 24 days ago | [–]

A relatively detailed description of such systems: https://arxiv.org/abs/2410.00037

Closi 24 days ago | | [–]

Seems like the future - so much meaning and context is lost otherwise.

intalentive 24 days ago | [–]

Very cool. Logical next step. Would be interested to know what the dataset looks like.

moffkalast 24 days ago | [–]

Youtube. Youtube is the dataset.

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact