I worked for about 3 years for an spanish law website which main business goal is to provide a centralized access point to legal documents for lawyers, and transforming PDFs to text was a regular task to do. We used perl and lots of PDF -> text/html tools. I can say it's one of the most horrible tasks to perform, and I'm glad there are more options to work with those documents, but PDFs as a method for content distribution should be shot and buried in the desert. It´s annoying to say the least when you don't have the necessary plugins/reader installed, and of course they can't be easily converted to other forms of documents without destroying something in the process. I'm happy I don't have to keep working on that anymore. A good thing though, is that now I master regular expressions.
Anyway, if someone gave this PDFMiner a shot, let me know how good/bad it is.
Yes the problem isn't getting the text out (of a PDF) but getting it out in some consistent manner!
I've been working on a clients project for last few weeks parsing historic tabular PDF reports into some semantic form. After some testing the clients team decided that using Adobe acrobat to export PDF has text was the best option. NB. best meaning but it was the most consistent export of the options they tried!
I've then written a Perl program using a custom parser to put meaning to all this lovely textural data :)
I tried converting a relatively simple PDF document into HTML and the results were average. There were overlapping fonts, missing images etc.
Working with PDFs is an extremely frustrating experience. For years I've dealt with the poorly documented Adobe SDK and many third party tools through working in the ebook industry. In my opinion converting PDF to HTML is close to an impossible task that will never yield consistent results and will always require manual intervention.
However, a few years ago we've developed an online reader that renders PDF files that have been converted to images with a text overlay that is far more reliable than a conversion to HTML. You can see an example book for free here: http://amigoreader.com/moonstone/. We hope to open this up for users to share and discuss their PDF documents in the near future.
I agree that PDF files are a nuisance to deal with in document repositories. I implemented a SharePoint clone 10 years ago for a customer in India (yes, a company in India hired a programmer in Arizona). They had 80 MBAs who churned out reports, new revisions, etc. The worst was dealing with analysts who embedded lots of image files in PDFs - they were not searchable and the text could not be extracted for summaries, etc.
I had the pleasure of using the out of the box pdf2txt tool just yesterday. Worked pretty well for extracting some governmental data released (i.e. buried) inside of a PDF!
I am disappointed that it can't extract the text from a password protected pdf since I have a few such documents. Not knowing much about the pdf format I would assume the text would be easy to extract since it is easy to show the text on the screen. What is the best way to print and copy/paste from such documents?
I can't remember the exact process I went through for my password protected PDFs, but the majority of them could be worked by converting them to a PS file and then back to PDF. I think I used GhostScript?
Anyway, if someone gave this PDFMiner a shot, let me know how good/bad it is.