Hacker News new | past | comments | ask | show | jobs | submit login

Under the hood tika uses tesseract for ocr parsing. For clarity this all works surprisingly well generally speaking and it’s pretty easy to run your self and order of magnitude cheaper than most services out there.

https://tesseract-ocr.github.io/tessdoc/




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: