Hacker News new | past | comments | ask | show | jobs | submit login

One of my good friends did a lot of research on PDFs as part of his graduate research. Older versions of Adobe Writer (maybe even the current one too?) would always append and never overwrite. So if you edited pages, it would add those edits to the bottom of the file. As long as you did everything in the Writer workflow and didn't Save As a new file, you could see a history of old edits. You can even find stuff that's blacked out in some government documents.



I cannot recommend qpdf [1] enough if you want to play around with PDFs.

Aside from being an excellent pdf manipulation library it also has a mode where it outputs a version of the pdf that is much easier to manipulate with a text editor and then lets you build a new pdf from that.

Shout out to Jay who has been steadily working on it for many many years. He is the most kind, undestanding and hard working free software developer I've had the pleasure to cross paths with. Thanks for all your hard work Jay!

[1] https://github.com/qpdf/qpdf


I don't see an ability to view PDF version history in this tool. Am I missing something?


From the docs:

QPDF is not a PDF content creation library, a PDF viewer, or a program capable of converting PDF into other formats.


But doesn't being able to see all streams mean that I can see edited/erased content from prior states of the document?


not necessarily, there's nothing that says that old content is preserved in inaccessible streams... the entire PDF file can be re-written discarding all old content.


This is by design and not surprising at all if you read even a tiny bit about PDF. It's in fact the default save method in nearly every PDF capable software. Rewriting the PDF is in fact the less common method. I'm surprised a researcher of PDF would be surprised by that.

However, if you are using a tool like a redaction tool then the software should forbid you from writing in append mode. This was a common error in old PDF apps and perhaps contemporary ones that are new.

Edit for politeness:

My surprise is aimed at the researcher, not you :)


The person you're replying to didn't do the research themselves. They said as much.

They were just sharing something they were surprised by/interested in. I was surprised to read that's how editing PDFs works too.


Yeah sorry if I'm unclear or misinterpreting. My comment is about the researcher not the person I replied to.

I do agree that it's surprising behaviour to regular users of PDF that it usually maintains a history of sorts. Apps should make this clearer.


The researcher probably wasnt surprised, it looks like the person you replied to was surprised. Perhaps i was surprised that you were surprised? :)


I'm surprised by all these appended suprises revealing a history of surprise :D


Surprisingly, the researcher might also have been surprised the first time they found out.


“You” is ambiguous in English.

https://www.merriam-webster.com/dictionary/you:

  1. the one or ones being addressed 
  2. ONE sense 2a (which is “being one in particular”)
So, pro tip: in chat-like discussions with strangers such as hacker news, one should prefer saying “one” when using sense 2, even if it sounds a bit archaic (at least to me. Is it?)

Also, when reading a “you” that could be interpreted both ways, do not assume it is used in sense 1.


"One" is very archaic, I always fear it will be confusing for non-native speakers, and sound stuck-up to native speakers, and tend to avoid it.


Where did they say the researcher friend was surprised? Where did they say they were surprised?


Wow that's kind of interesting and the least bit surprising.

Wasn't there a search engine built into finding redacted PDF content? I think it made the headlines here a while back.


Searching for "pdf search" isn't finding anything significant.

"PDF drive" (https://news.ycombinator.com/item?id=25240373, 0 comments) just appears to be an ebook crawler over in the less-than-#FFFFFF-department if you get what I mean.

I also found a thread talking about searching PDFs for specific queries (https://news.ycombinator.com/item?id=10154527) which appears to have generated some interesting results back when the thread was posted, in 2015.

Not seeing anything recent though. But on the subject of a search engine specifically for finding redacted content, I couldn't help but imagine the discussion...

"Hi, I would like to find a •••••••."

"You specifically want a •••••••?"

"Yes, literally."

[Person 2 walks away scratching their head wondering what person 1 would do with a 'hunter2']


You might be remembering `Google PDF Search: “not for public release”` from 2015 [1] and 2019 [2].

[1] https://news.ycombinator.com/item?id=10154527

[2] https://news.ycombinator.com/item?id=20420209


might have been the one on reversing pixelation?


Isn’t programming fun?

There’s always a scary world lurking underneath it seems.


sounds more like an intentional backdoor


Not really, this type of save changes at the end used to be fairly common (i assume for performance reasons on big docs back when computers were much more constrained) microsoft word did the same thing back in the day.


It's not only common it's still the way PDFs are usually saved. Open a PDF in a text editor (PDFs are text not binary files) and you can see any edits appended as "trailers".


This sounds like a feature that could be exploited in creative ways in either a product or some fun side project. I don't know what it is exactly but there's something aout never editing edit logs (possible not being obvious to the user as a factor) plus some graphical UI representation or UX flow (besides undo).


I remember reading that the early versions of MS Word did the same thing, for performance reasons.


Blockchain FTW!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: