Pdftabextract – A set of tools for data mining OCR-processed PDFs

derwiki · on Feb 27, 2017

I did a doubletake; I thought I had just seen this on HN; turns PDFLayoutTextStripper was on the front page a few days ago: https://news.ycombinator.com/item?id=13729301

markovbling · on Feb 27, 2017

awesome! any guidance on why I might use this rather than Tabula?

nycdatasci · on Feb 27, 2017

Tabula works on text-based PDF documents, not on scanned content so I assume it's not using OCR?

mrdrozdov · on Feb 27, 2017

Anyone using this yet to automatically track SotA results on machine learning tasks?