I don't, sorry. From what you wrote, you definitely seem more knowledgeable than...

thangngoc89 · on Jan 2, 2021

Thank you for your input. I forgot the mention in the original post that there is a tool called pdf-parser.py [1] which claims to be able to do that but it produces a broken output. I don’t know anything about Python or PDF internals to hack on it. Posting it here and hoping that the HN crowds could point me in the correct direction.

[1]: https://blog.didierstevens.com/programs/pdf-tools/

solresol · on Jan 2, 2021

I'd like to talk to you about this, but you don't have any contact details in your profile. You can find me email address in my profile.

thangngoc89 · on Jan 2, 2021

Thank you very much. I updated my profile with an email address. Nevertheless, I emailed you via the contact details

aidos · on Jan 2, 2021

Top tip: install mutool and run

mutool clean -d your.pdf clean.pdf

Now open clean.pdf with a text editor.

thangngoc89 · on Jan 2, 2021

That's really a top tip! Thank you very much. It looks like the original file is compressed using FlatDecode. Passing through mutool decompresses all streams and let's the parser does its job.

Thank you!

aidos · on Jan 2, 2021

Great! Glad it worked. Happy to help you unpick things a bit further. When you look inside the pdf file you’ll see that it’s actually a “tree” of “things”. Each one starts with “obj 0 1234” (or something like that). And they reference each other to build the structure. So for example, the document is made of a list of pages. So that’s one object. And each page is another object. And then each page is made of a bunch more stuff and so on. Somewhere in there, no doubt, you’ll find an object that’s your model.