Yes the problem isn't getting the text out (of a PDF) but getting it out in some consistent manner!
I've been working on a clients project for last few weeks parsing historic tabular PDF reports into some semantic form. After some testing the clients team decided that using Adobe acrobat to export PDF has text was the best option. NB. best meaning but it was the most consistent export of the options they tried!
I've then written a Perl program using a custom parser to put meaning to all this lovely textural data :)
I've been working on a clients project for last few weeks parsing historic tabular PDF reports into some semantic form. After some testing the clients team decided that using Adobe acrobat to export PDF has text was the best option. NB. best meaning but it was the most consistent export of the options they tried!
I've then written a Perl program using a custom parser to put meaning to all this lovely textural data :)
PS. Given more time I'd like to have used off shelf parser like Regexp::Grammars - https://metacpan.org/module/Regexp::Grammars
PPS. And given more involvement in the PDF extraction process/decision I would have like to tested CAM::PDF - https://metacpan.org/module/CAM::PDF