Hacker News new | past | comments | ask | show | jobs | submit login

Thank you for your input. I forgot the mention in the original post that there is a tool called pdf-parser.py [1] which claims to be able to do that but it produces a broken output. I don’t know anything about Python or PDF internals to hack on it. Posting it here and hoping that the HN crowds could point me in the correct direction.

[1]: https://blog.didierstevens.com/programs/pdf-tools/




I'd like to talk to you about this, but you don't have any contact details in your profile. You can find me email address in my profile.


Thank you very much. I updated my profile with an email address. Nevertheless, I emailed you via the contact details


Top tip: install mutool and run

mutool clean -d your.pdf clean.pdf

Now open clean.pdf with a text editor.


That's really a top tip! Thank you very much. It looks like the original file is compressed using FlatDecode. Passing through mutool decompresses all streams and let's the parser does its job.

Thank you!


Great! Glad it worked. Happy to help you unpick things a bit further. When you look inside the pdf file you’ll see that it’s actually a “tree” of “things”. Each one starts with “obj 0 1234” (or something like that). And they reference each other to build the structure. So for example, the document is made of a list of pages. So that’s one object. And each page is another object. And then each page is made of a bunch more stuff and so on. Somewhere in there, no doubt, you’ll find an object that’s your model.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: