Note that a video is just a sequence of images: OpenAI has a demo with GPT-4-Vis...

ankeshanand · on Feb 21, 2024

We've done extensive comparisons against GPT-4V for video inputs in our technical report: https://storage.googleapis.com/deepmind-media/gemini/gemini_....

Most notably, at 1FPS the GPT-4V API errors out around 3-4 mins, while 1.5 Pro supports upto an hour of video inputs.

jxy · on Feb 21, 2024

So that 3-4 mins at 1FPS means you are using about 500 to 700 tokens per image, which means you are using `detail: high` with something like 1080p to feed to gpt-4-vision-preview (unless you have another private endpoint).

The gemini 1.5 pro uses about 258 tokens per frame (2.8M tokens for 10856 frames).

Are those comparable?

moralestapia · on Feb 21, 2024

>while 1.5 Pro supports upto an hour of video inputs

At what price, tho?

verticalscaler · on Feb 21, 2024

The average shot length in modern movies is between 4 and 16 seconds and around 1 minute for a scene.

simonw · on Feb 21, 2024

The number of tokens used for videos - 1,841 for my 7s video, 6,049 for 22s - suggests to me that this is a much more efficient way of processing content than individual frames.

For structured data extraction I also like not having to run pseudo-OCR on hundreds of frames and then combine the results myself.

og_kalu · on Feb 21, 2024

No it's individual frames

https://developers.googleblog.com/2024/02/gemini-15-availabl...

"Gemini 1.5 Pro can also reason across up to 1 hour of video. When you attach a video, Google AI Studio breaks it down into thousands of frames (without audio),..."

But it's very likely individual frames at 1 frame/s

https://storage.googleapis.com/deepmind-media/gemini/gemini_...

"Figure 5 | When prompted with a 45 minute Buster Keaton movie “Sherlock Jr." (1924) (2,674 frames at 1FPS, 684k tokens), Gemini 1.5 Pro retrieves and extracts textual information from a specific frame in and provides the corresponding timestamp. At bottom right, the model identifies a scene in the movie from a hand-drawn sketch."

simonw · on Feb 21, 2024

Despite that being in their blog post, I'm skeptical. I tried uploading a single frame of the video as an image and it consumed 258 tokens. The 7s video was 1,841 tokens.

I think it's more complicated than just "split the video into frames and process those" - otherwise I would expect the token count for the video to be much higher than that.

UPDATE ... posted that before you edited your post to link to the Gemini 1.5 report.

684,000 (total tokens for the movie) / 2,674 (their frame count for that movie) = 256 tokens - which is about the same as my 258 tokens for a single image. So I think you're right - it really does just split the video into frames and process them as separate images.

simonw · on Feb 21, 2024

Added a note about this to my post: https://simonwillison.net/2024/Feb/21/gemini-pro-video/#imag...

infecto · on Feb 21, 2024

Edit: Was going to post similar to your update. 1841/258 = ~7

Arelius · on Feb 21, 2024

I mean, that's just over 7 frames, or one frame/s of video. There are likely fewer then that many I-frames in your video.

Zetobal · on Feb 21, 2024

The model is fed individual frames from the movie BUT the movie is segmented into scenes. These scenes, are held in context for 5-10 scenes, depending on their length. If the video exceeds a specific length or better said a threshold of scenes it creates an index and summary. So yes technically the model looks at individual frames but it's a bit more tooling behind it.

minimaxir · on Feb 21, 2024

From the Gemini 1.0 Pro API docs (which may not be the same as Gemini 1.5 in Data Studio): https://cloud.google.com/vertex-ai/docs/generative-ai/multim...

> The model processes videos as non-contiguous image frames from the video. Audio isn't included. If you notice the model missing some content from the video, try making the video shorter so that the model captures a greater portion of the video content.

> Only information in the first 2 minutes is processed.

> Each video accounts for 1,032 tokens.

That last point is weird because there is no way a video would be a fixed amount of tokens and I suspect is a typo. The value is exactly 4x the number of tokens for an image input to Gemini (258 tokens) which may be a hint to the implementation.

btbuildem · on Feb 22, 2024

Given how video is compressed (usually, key frames + series of diffs) perhaps there's some internal optimization leveraging that (key frame: bunch of tokens, diff frames: much fewer tokens)

arbuge · on Feb 22, 2024

How is sound handled?

All I see in the Gemini docs is a terse sentence that says it isn’t included, which doesn’t sound like an optimal solution.

OkGoDoIt · on Feb 23, 2024

It doesn’t appear to be using the sound from the video, but elsewhere in the report for Gemini 1.5 pro it mentions it can handle sound directly as an input, without first transcribing it to text (including a chart that makes the point it’s much more accurate than transcribing text with whisper and then querying it using GPT-4).

But I don’t think it went into detail about how exactly that works, and I’m not sure if the API/front end has a good way to handle that.

minimaxir · on Feb 22, 2024

Models have to be trained to understand sound, it's not free.

belter · on Feb 21, 2024

Prompt injection via Video?

nomel · on Feb 21, 2024

Probably: https://simonwillison.net/2023/Oct/14/multi-modal-prompt-inj...

janpmz · on Feb 22, 2024

On the other hand, a picture is a video with a single frame.

DerCommodore · on Feb 22, 2024

I expected more from the video