If GPT-4-Vision supported function calling/structured data for guaranteed JSON output, that would be nice though.
There's shenanigans you can do with ffmpeg to output every-other-frame to halve the costs too. The OpenAI demo passes every 50th frame of a ~600 frame video (20s at 30fps).
EDIT: As noted in discussions below, Gemini 1.5 appears to take 1 frame every second as input.
So that 3-4 mins at 1FPS means you are using about 500 to 700 tokens per image, which means you are using `detail: high` with something like 1080p to feed to gpt-4-vision-preview (unless you have another private endpoint).
The gemini 1.5 pro uses about 258 tokens per frame (2.8M tokens for 10856 frames).
The number of tokens used for videos - 1,841 for my 7s video, 6,049 for 22s - suggests to me that this is a much more efficient way of processing content than individual frames.
For structured data extraction I also like not having to run pseudo-OCR on hundreds of frames and then combine the results myself.
"Gemini 1.5 Pro can also reason across up to 1 hour of video. When you attach a video, Google AI Studio breaks it down into thousands of frames (without audio),..."
But it's very likely individual frames at 1 frame/s
"Figure 5 | When prompted with a 45 minute Buster Keaton movie “Sherlock Jr." (1924) (2,674 frames
at 1FPS, 684k tokens), Gemini 1.5 Pro retrieves and extracts textual information from a specific frame
in and provides the corresponding timestamp. At bottom right, the model identifies a scene in the
movie from a hand-drawn sketch."
Despite that being in their blog post, I'm skeptical. I tried uploading a single frame of the video as an image and it consumed 258 tokens. The 7s video was 1,841 tokens.
I think it's more complicated than just "split the video into frames and process those" - otherwise I would expect the token count for the video to be much higher than that.
UPDATE ... posted that before you edited your post to link to the Gemini 1.5 report.
684,000 (total tokens for the movie) / 2,674 (their frame count for that movie) = 256 tokens - which is about the same as my 258 tokens for a single image. So I think you're right - it really does just split the video into frames and process them as separate images.
The model is fed individual frames from the movie BUT the movie is segmented into scenes. These scenes, are held in context for 5-10 scenes, depending on their length. If the video exceeds a specific length or better said a threshold of scenes it creates an index and summary. So yes technically the model looks at individual frames but it's a bit more tooling behind it.
> The model processes videos as non-contiguous image frames from the video. Audio isn't included. If you notice the model missing some content from the video, try making the video shorter so that the model captures a greater portion of the video content.
> Only information in the first 2 minutes is processed.
> Each video accounts for 1,032 tokens.
That last point is weird because there is no way a video would be a fixed amount of tokens and I suspect is a typo. The value is exactly 4x the number of tokens for an image input to Gemini (258 tokens) which may be a hint to the implementation.
Given how video is compressed (usually, key frames + series of diffs) perhaps there's some internal optimization leveraging that (key frame: bunch of tokens, diff frames: much fewer tokens)
It doesn’t appear to be using the sound from the video, but elsewhere in the report for Gemini 1.5 pro it mentions it can handle sound directly as an input, without first transcribing it to text (including a chart that makes the point it’s much more accurate than transcribing text with whisper and then querying it using GPT-4).
But I don’t think it went into detail about how exactly that works, and I’m not sure if the API/front end has a good way to handle that.
If GPT-4-Vision supported function calling/structured data for guaranteed JSON output, that would be nice though.
There's shenanigans you can do with ffmpeg to output every-other-frame to halve the costs too. The OpenAI demo passes every 50th frame of a ~600 frame video (20s at 30fps).
EDIT: As noted in discussions below, Gemini 1.5 appears to take 1 frame every second as input.