Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fantastic work is emerging in this field, and with the new release of the schnell model of the flux series we will have the downstream captioning datasets we need to produce a new SOTA vision model, which has been the last straggler in the various open llm augmentations. Most vision models are still based on ancient CLIP/BLIP captioning and even with something like LLAVA or the remarkable phi-llava, we are still held back by the pretained vision components which have been needing love for some months now.

Tessy and LLM is a good pipe, it's likely what produced SCHNELL and will soon be the reverse of this configuration, used for testing and checking while the LLM does the bulk of transcription via vision modality adaption. The fun part of that is that multi lingual models will be able to read and translate, opening up new work for scholars searching through digitized works. Already I have had success in this area with no development at all, after we get our next SOTA vision models I am expecting a massive jump in quality. I expect english vision model adapters to show up using LLAVA architecture first, this may put some other latin script languages into the readable category depending on the adapted model, but we could see a leapfrog of scripts becoming readable all at once. LLAVA-PHI3 already seems to be able to transcribe tiny pieces of hebrew with relative consistency. It also has horrible hallucinations, so there is very much an unknown limiting factor here currently. I was planning some segmentation experiments but schnell knocked that out of my hands like a bar of soap in a prison shower, I will be waiting for a distilled captioning sota to come before I re-evaluate this area.

Exciting times!



Is LLaVA-Phi better than Phi Vision?

edit: I think parent just doesn't know about Phi Vision, it appears to be a better model




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: