Hacker News new | past | comments | ask | show | jobs | submit login

Interesting. I can confirm my MacOS (12.6.3) Preview app (11.0) is having problems with that PDF -- for instance, pages 2-12 are blank in MacOS Preview, where pages 3, 5 and 7 show in the in-browser HTML/JS viewer. (pages 4, 6, and 8 are actually blank I think!).

If I view the PDF using the Chrome browser -- it's PDF rendering engine has no apparent problems, it seems to be showing the same pages as the in-browser HTML/JS viewer.

I've been investigating PDF generation from digitized historical materials myself, lately, so I happen to know that the Internet Archive PDFs use some sophisticated compression techniques to try to make PDFs substantially smaller than they would be including relatively high-res raster JPGs. They use both JPEG2K raster images, and JBIG2 bitmaps, in a sophisticated manner that uses a bitmask to try to apply higher-quality compression to the text, and lower-quality (more compressed, smaller bytesize) compression to the background pages.

This is known as "Mixed Raster Content (MRC)" approach [1] -- several commercial/proprietary packages also claim to implement it, I think mostly in the "document management" space. I think Internet Archive's open source python is the only open source implementation. [2]. (They have been using their home-built open source tool for 2-3 years, before that they were using a toolchain involving a proprietary tool for the MRC-style compression; I don't know if some PDF downloads on the live site may still be cached from previous tool chain or not).

It's a pretty neat technique. Here's a video where Merlijn Wajer from Internet Archive talks about their project. [3]. Here's one proprietary software vendors explanation of the MRC technique. [4]

While everything used in this approach ought to be in-spec for PDF rendering, I wonder if some PDF renderers (such as MacOS Preview) are having trouble with either some of the image formats (JPEG2000 or JBIG2, although both are spec'd by PDF standard), or the overall technique.

The alternative is pretty enormous PDFs for digitized materials at full-resolution though.

Since I've been investigating PDF generation for digitized historical content, though, I am curious what is going on here, and how to avoid it.

[1] https://en.wikipedia.org/wiki/Mixed_raster_content

[2] https://archive-pdf-tools.readthedocs.io/en/latest/

[3] https://youtu.be/DqA1YPfDlhg?t=972

[4] https://www.gdpicture.com/blog/advanced-mrc-compression/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: