Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
chezmo
on July 21, 2016
|
parent
|
context
|
favorite
| on:
Show HN: Convert PDF files into structured data
We do position based text extraction. We add however an 'unpaper' function which tries to correct misalignments and increases the quality of the scan.
ComodoHacker
on July 21, 2016
[–]
What OCR library do you use? What languages it supports?
chezmo
on July 21, 2016
|
parent
[–]
For scanned images we use
https://github.com/tesseract-ocr/tesseract
. For text based PDFs we pull the text directly from the file and all languages are supported.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: