Hacker News new | past | comments | ask | show | jobs | submit login

I see tesseract mentioned more and more.

Myself I tried it probably 10-15 years ago on scanned scientific papers (decent scanning quality). The results were disappointing. The manual postprocessing required was not much less than typing it directly. So tesseract became a synonym of "not worth trying" to me.

Maybe things have improved over the years, so I should give it a new try. (No particular use case at the moment, but those tend to appear occasionally.)




It’s good now _if_ you OCR only scanned documents or otherwise have a lot of control over how you prepare the images before it’s OCR’ed. For more general purpose recognition with weird fonts and bad image quality EasyOCR gave me much better results


This project is including Tesseract 4.1.1 which is at least a couple years old.


Try https://github.com/ocrmypdf/OCRmyPDF - it uses Tesseract behind the scenes and it absolutely brilliant.


It's way better now. I used it 15 years ago and had to do quite a bit of preprocessing to get not-entirely-terrible results, but now I use it with great success and no preprocessing.


First time I used it 3 to 4 years ago, it was good.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: