This is not at all how this works. There's no separate model. Yes there's unique tokenization, if not the video as a whole then for each image. The whole video is ~1800 tokens because Gemini gets video as a series of images in context at 1 frame/s. Each image is about 258 tokens because a token in image transformer terms is literally a patch of the image.
You can literally convert the tokens returned from a video to text. What do you even think tokens are?
Like seriously, before you write another word on this feel free to call the API and retrieve tokens for a video or image. Now go through the magical process of converting those tokens back to their text form. It isn't some magical hyper-dimensional, inside-out spatial encoding that yields impossible compression.
This process is obvious and logical if actually thought through.
>Each image is about 258 tokens
Because Google set that as the "budget" and truncates accordingly. Again, call the API with an image or video and then convert those tokens to text.
>You can literally convert the tokens returned from a video to text. What do you even think tokens are?
Tokens are patches of each image.
It's amazing to me how people will confidently spout utter nonsense. It only takes looking at the technical report for the Gemini models to see that you're completely wrong.
>The visual encoding of Gemini models is inspired by our own foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al.,2022), with the important distinction that the models are multimodal from the beginning and can natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).
>It's amazing to me how people will confidently spout utter nonsense.
Ok.
You seem to be conflating some things, evident when you suddenly dropped the ViT paper as evidentiary. During the analysis of images, tiles and transformers (such as a ViT) are used. This is the model of processing the image to obtain useful information, such as to do OCR (you might notice that that word used repeatedly in the Google paper).
But to actually use the image, context has to be drawn from it. This is pretty bog standard OCR, object detection and classification, sentiment analysis, etc. This yields tokens.
Have you called the API and generated tokens from an image yet? Try it. You'll find they aren't as magical and mysterious as you believe, and your quasi-understanding of a ViT is not relevant to the tokens retrieved from a multimodal LLM.
There is the notion of semantic image tokens, which is an inner property of the analysis engine for images (and, conversely, the generation engine) but it is not what we're talking about. If an image was somehow collapsed into a 16x16 array of integers and amazingly it could still tell you the words on books and the objects that appear, that would be amazing. Too amazing.
>But to actually use the image, context has to be drawn from it. This is pretty bog standard OCR, object detection and classification, sentiment analysis, etc. This yields tokens
None of that is necessary for an Autoregressive Transformer. You can train the transformer to predict text tokens given interleaved image and text input tokens in the context window.
Google have already told us how this works. Read the Flamingo or Pali papers. You are wrong. Very wrong.
It's incredible that people will crucify LLMs for "hallucinating" but then there are humans like you running around.
https://arxiv.org/abs/2010.11929