Its the same principle as open transformer models where an adapter is used to generate the embedding
However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.
The tech is there, the base model needs to be better