Hacker News new | past | comments | ask | show | jobs | submit login

Image tokens =/ Text tokens.

Image tokens are patches of the image. Each image is divided into ~256 parts. Those parts are the tokens.

There's no separate run to another OCR.




Completely wrong.

Well, aside from the edited in bit about OCR. Of course there isn't a separate run to do OCR because that was literally the first step during image analysis. You know, before the conversion to simple tokens.


There's no run to any OCR, first step or not.

And you have no idea what you're talking about.


You understand that OCR is the process of extracting text from images, right? You know, such as what Gemini does, and they reference repeatedly in their paper. I have absolutely no idea why you repeatedly make some bizarre distinction about it being a "separate process".

Okay, it's been fun talking to you but feel free to have the last word. Good luck.


The transformer (Gemini) predicts text with image and text in the context window. That's it.

OCR, Object detection etc all come from the transformer predicting text. Read the Flamingo paper.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: