We do position based text extraction. We add however an 'unpaper' function which... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

chezmo on July 21, 2016 | parent | context | favorite | on: Show HN: Convert PDF files into structured data

We do position based text extraction. We add however an 'unpaper' function which tries to correct misalignments and increases the quality of the scan.

ComodoHacker on July 21, 2016 [–]

What OCR library do you use? What languages it supports?

chezmo on July 21, 2016 | [–]

For scanned images we use https://github.com/tesseract-ocr/tesseract. For text based PDFs we pull the text directly from the file and all languages are supported.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact