Image tokens =/ Text tokens. Image tokens are patches of the image. Each image i...

llm_nerd · on Feb 22, 2024

Completely wrong.

Well, aside from the edited in bit about OCR. Of course there isn't a separate run to do OCR because that was literally the first step during image analysis. You know, before the conversion to simple tokens.

og_kalu · on Feb 22, 2024

There's no run to any OCR, first step or not.

And you have no idea what you're talking about.

llm_nerd · on Feb 22, 2024

You understand that OCR is the process of extracting text from images, right? You know, such as what Gemini does, and they reference repeatedly in their paper. I have absolutely no idea why you repeatedly make some bizarre distinction about it being a "separate process".

Okay, it's been fun talking to you but feel free to have the last word. Good luck.

og_kalu · on Feb 22, 2024

The transformer (Gemini) predicts text with image and text in the context window. That's it.

OCR, Object detection etc all come from the transformer predicting text. Read the Flamingo paper.