Dumb question: Why can't we just burn PNGs[0] or lossless JPEGs and just use OCR...

da_chicken · on Dec 13, 2018

Because OCR is expensive (to write as software and to process for the end user) and very error prone, especially if your text is anything other than a 12 point black font in on a white background with no formatting (italics, underlines, etc.). If my document's information is valuable, I'm not going to be willing to rely on the quality of my recipient's OCR software to get a digitally readable copy of my work. I mean, at the very least, what if they're blind?

The general hatred for PDFs in the tech community is almost completely rooted in Adobe's initial decision to make PDF editing and creation cost $500. You have access to a document that want to make changes to, but you can't because it's a PDF and don't have access to the document source because the owner/publisher didn't provide that. It's a PDF because PDFs make documents that look the same everywhere, even when printed, which is and will remain critical to the purpose of publishing documents. Well, images don't solve this problem, either, because you still can't edit text in an image, and now you lose the ability to be sure about how they'll print (margins, scaling, etc.).

Furthermore, images, even compressed, are significantly larger than a well made PDF. For example, I've got a 6,700 page document of special ed student progress reports that include detailed, full-color charts and graphs of student progress with respect to goals. It's 60 MB. 8.5 KiB per page.

Then again, I imagine it won't be long before someone mentions LaTeX as a viable alternative, even though the one thing LaTeX isn't is portable. But LaTeX is primarily popular in the tech community because it lets programmers pretend to write code while they're actually writing documentation. Nowhere else will you find people telling you to use a set of programs that require a build environment when someone asks about the best home office application to use. (Yes, I know that LaTeX is a typesetting language. My cynicism is that some tech people tell others to use LaTeX when they're asked what word processor someone should use.)

Edit: Clarified second paragraph.

intertextuality · on Dec 13, 2018

> Then again, I imagine it won't be long before someone mentions LaTeX as a viable alternative, even though the one thing LaTeX isn't is portable. But LaTeX is primarily popular in the tech community because it lets programmers pretend to write code

Rude remarks notwithstanding, LaTeX and its ilk let you make PDFs, which are indeed portable. Setting up LaTeX is the same as setting up any other program, some of which are not portable either. ShareLatex.com [0] also exists for the purpose of using LaTeX anywhere.

People recommend LaTeX because it's in another league when it comes to typesetting and rendering more niche notation. It's also not user hostile when it comes to binary files. LaTeX source files will always be readable decades later, <binary app here> makes no such guarantees.

Whether it's a viable alternative depends on whether the user wants to make a minimal learning investment or not. If they don't, google sheets > export to pdf always exists.

[0]: https://www.sharelatex.com/

grkvlt · on Dec 14, 2018

dvi files are about as user hostile as it gets for a rendered document format, though

3pt14159 · on Dec 13, 2018

No the hatred for PDFs is that they're filled with bloat and horribly insecure.

As for OCR, we're able to handle underlines and italics for most fonts, though I take your point on colour. If it's especially bad they fail. Ideally it wouldn't be PNGs it would be some stripped down thing. Maybe even HTML with embeded CSS / images via data tags would fit the bill, but now we're bringing in XML-esque parsers and those are garbage too. I'm just so frustrated with dealing with PDFs. They serve a billion different purposes and they're good at none of them.

wbl · on Dec 13, 2018

Accessibility, plus print is at ridiculous DPI compared to screen. To achieve compression you want to use the fact that there is a font being repeated across the page. OCR just isn't good enough.

3pt14159 · on Dec 13, 2018

Are you telling me that our compression algorithms can't compress a page of "e"s tighter than a page of random Chinese characters?

Accessibility is a fair point, but for print-to-file applications we're surely at the point where OCR can at least get the text to a readable format, no?