PDFMiner in Python

mrleinad · on Dec 11, 2011

I worked for about 3 years for an spanish law website which main business goal is to provide a centralized access point to legal documents for lawyers, and transforming PDFs to text was a regular task to do. We used perl and lots of PDF -> text/html tools. I can say it's one of the most horrible tasks to perform, and I'm glad there are more options to work with those documents, but PDFs as a method for content distribution should be shot and buried in the desert. It´s annoying to say the least when you don't have the necessary plugins/reader installed, and of course they can't be easily converted to other forms of documents without destroying something in the process. I'm happy I don't have to keep working on that anymore. A good thing though, is that now I master regular expressions.

Anyway, if someone gave this PDFMiner a shot, let me know how good/bad it is.

draegtun · on Dec 11, 2011

Yes the problem isn't getting the text out (of a PDF) but getting it out in some consistent manner!

I've been working on a clients project for last few weeks parsing historic tabular PDF reports into some semantic form. After some testing the clients team decided that using Adobe acrobat to export PDF has text was the best option. NB. best meaning but it was the most consistent export of the options they tried!

I've then written a Perl program using a custom parser to put meaning to all this lovely textural data :)

PS. Given more time I'd like to have used off shelf parser like Regexp::Grammars - https://metacpan.org/module/Regexp::Grammars

PPS. And given more involvement in the PDF extraction process/decision I would have like to tested CAM::PDF - https://metacpan.org/module/CAM::PDF

tren · on Dec 11, 2011

I tried converting a relatively simple PDF document into HTML and the results were average. There were overlapping fonts, missing images etc.

Working with PDFs is an extremely frustrating experience. For years I've dealt with the poorly documented Adobe SDK and many third party tools through working in the ebook industry. In my opinion converting PDF to HTML is close to an impossible task that will never yield consistent results and will always require manual intervention.

However, a few years ago we've developed an online reader that renders PDF files that have been converted to images with a text overlay that is far more reliable than a conversion to HTML. You can see an example book for free here: http://amigoreader.com/moonstone/. We hope to open this up for users to share and discuss their PDF documents in the near future.

itmag · on Dec 11, 2011

Dude, I am very interested in this.

When are you launching?

tren · on Dec 12, 2011

Early 2012 we're aiming for. We're launching our WP7 and Android readers first.

mark_l_watson · on Dec 11, 2011

I agree that PDF files are a nuisance to deal with in document repositories. I implemented a SharePoint clone 10 years ago for a customer in India (yes, a company in India hired a programmer in Arizona). They had 80 MBAs who churned out reports, new revisions, etc. The worst was dealing with analysts who embedded lots of image files in PDFs - they were not searchable and the text could not be extracted for summaries, etc.

trentonstrong · on Dec 11, 2011

I had the pleasure of using the out of the box pdf2txt tool just yesterday. Worked pretty well for extracting some governmental data released (i.e. buried) inside of a PDF!

doktrin · on Dec 11, 2011

Sick! The lack of adequate functionality in PyPDF has been a pet peeve of mine for a while! I really look forward to trying this out.

gahahaha · on Dec 11, 2011

I am disappointed that it can't extract the text from a password protected pdf since I have a few such documents. Not knowing much about the pdf format I would assume the text would be easy to extract since it is easy to show the text on the screen. What is the best way to print and copy/paste from such documents?

narcissus · on Dec 11, 2011

I can't remember the exact process I went through for my password protected PDFs, but the majority of them could be worked by converting them to a PS file and then back to PDF. I think I used GhostScript?