Hacker News new | past | comments | ask | show | jobs | submit login
Pdftabextract – A set of tools for data mining OCR-processed PDFs (github.com/wzbsocialsciencecenter)
143 points by happy-go-lucky on Feb 26, 2017 | hide | past | favorite | 4 comments



I did a doubletake; I thought I had just seen this on HN; turns PDFLayoutTextStripper was on the front page a few days ago: https://news.ycombinator.com/item?id=13729301


awesome! any guidance on why I might use this rather than Tabula?


Tabula works on text-based PDF documents, not on scanned content so I assume it's not using OCR?


Anyone using this yet to automatically track SotA results on machine learning tasks?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: