Hacker News new | past | comments | ask | show | jobs | submit login

Has there been progress towards making RWKV multimodal? Can be use projector layers to send images to RWKV?



There is work done for Vision RWKV, and audio RWKV, an example paper is here: https://arxiv.org/abs/2403.02308

Its the same principle as open transformer models where an adapter is used to generate the embedding

However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.

The tech is there, the base model needs to be better




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: