Hacker News new | past | comments | ask | show | jobs | submit login

We do position based text extraction. We add however an 'unpaper' function which tries to correct misalignments and increases the quality of the scan.



What OCR library do you use? What languages it supports?


For scanned images we use https://github.com/tesseract-ocr/tesseract. For text based PDFs we pull the text directly from the file and all languages are supported.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: