Can anyone recommend a method to deduplicate pdfs? The hash is often different b...

pixelmonkey · 2024-07-02T15:18:10 1719933490

You might want strip metadata before doing a comparison, using exiftool. Even though exiftool was originally written for EXIF metadata on JPGs, these days, it supports a lot of metadata standards, including PDF. This command will do it assuming you set filename=`basename your.pdf .pdf`:

    exiftool -all= -o ${filename}.stripped.pdf ${filename}.pdf

That won't help you with small differences in the contents, but might help with small differences in metadata. Running `md5sum` on the stripped PDF should give more reliable dedupe results.

I was recently working on a similar problem for JPG, RAW, and MP4 files (photo/video backup) so it is fresh in my mind.

bob1029 · 2024-07-02T15:56:43 1719935803

I would consider rasterizing the PDFs and then hashing the resulting bitmaps.

strangus · 2024-07-02T14:15:40 1719929740