Hacker News new | past | comments | ask | show | jobs | submit login

Can anyone recommend a method to deduplicate pdfs? The hash is often different but the content and meta data is 99.99% the same.



You might want strip metadata before doing a comparison, using exiftool. Even though exiftool was originally written for EXIF metadata on JPGs, these days, it supports a lot of metadata standards, including PDF. This command will do it assuming you set filename=`basename your.pdf .pdf`:

    exiftool -all= -o ${filename}.stripped.pdf ${filename}.pdf
That won't help you with small differences in the contents, but might help with small differences in metadata. Running `md5sum` on the stripped PDF should give more reliable dedupe results.

I was recently working on a similar problem for JPG, RAW, and MP4 files (photo/video backup) so it is fresh in my mind.


I would consider rasterizing the PDFs and then hashing the resulting bitmaps.


cp?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: