A token is a single integer from a dictionary of a given model's vocabulary (e.g...

l33tman · on Feb 22, 2024

Yeah but we're not talking about LLMs here but vision transformers, which don't use the same type of token vocabulary to produce embeddings from the input as the LLMs do. The pixel data is much more dense than a few characters is, per token.

I looked it up - the original ViT models directly projected for example 16x16 pixel patches into 768-dimensional "tokens". So a 224x224 image ended up as 14*14=196 "tokens" each of which is a 768-dimensional vector. The positional encoding is just added to this vector.

This blog-post has the specific numbers, which makes it a bit less abstract than in the original paper: https://amaarora.github.io/posts/2021-01-18-ViT.html

llm_nerd · on Feb 22, 2024

>Yeah but we're not talking about LLMs here but vision transformers

We ultimately are. Gemini is a multimodal model whose core function is an LLM. This doesn't mean that everything flows through the same pathway -- different modalities have different paths -- but eventually there is fusion through which a common representations appears. It's where the worlds combine. That parlance is often tokens, though it obviously depends upon the architecture and we simply don't have those details for Gemini (the paper is extremely superficial). The fact that it will ingest massive videos and then post-facto answer arbitrary queries on it is a good clue, however.

>This blog-post has the specific number

It's a great link and an enjoyable read, and while the ViT plays a critical role in virtually all image analysis pipelines, including in Gemini where it is a part of OCR, object detection, etc, the numbers you are referring to do not map to tokens.

E.g. the 768 dimensions are nothing more than the underlying image data for the tile. e.g. 16x16x3 channels. I'm unaware of any ViT resources that refers to those vectors (vectorized because that's the form GPUs like) as tokens. This system could lazily reuse it, but the way processing happens in ViTs would make that a completely irrational overlap of terms.

The role that a token plays in that description is the classifier -- basically the output that classifies each tile.

Ultimately the number of tokens that Google or OpenAI assign to processing an image or video is a billing artifact because tokens are the measure by which things are billed. However you can ask these systems for the tokens representing an image and it will be exactly what one would expect. Indeed, the brilliance of image (and thus video) analysis in these multimodal systems is not nearly as deep as first glances might assume, and often it can derive nothing more than the most obvious classifications. e.g. classifications made without knowing anything about what the user specifically wants. It is usually fantastic at things like OCR, which happens to be a very common need.

These systems obviously have different usage patterns. I can do simultaneous processing where the image and command work in concert, image analysis deep diving on specifically those elements that are wanted (but that would otherwise be ignored). Or I can do the classic feed a video or an image and then ask questions where the dominant model is to tokenize the video or images using the common flow (OCR, object detection, etc), create a token narrative, and then answer the question from the narrative.