Hacker News new | past | comments | ask | show | jobs | submit login
Horrifying PDF Experiments (github.com/osnr)
493 points by thesephist on Jan 2, 2021 | hide | past | favorite | 180 comments



"It might be possible to embed a C compiler into a PDF by compiling it to JS with Emscripten, for example, but then your C compiler has to take input through a plain-text form field and spit its output back through a form field."


You know, when I read "Horrifying PDF," I thought it would be an exaggeration.


Even just basic text is... interesting in PDFs. A few years back, I created a PDF which displayed its own MD5 hash by making every single letter a separate chain of sub-documents and using MD5 collisions to change which sub-document each pointed to without changing the hash. Pretty much every PDF reader managed to interpret this cleanly as ordinary, copy-and-pasteable text because it wasn't any worse than they could expect to encounter in an ordinary PDF, and they all had robust heuristics for dealing with these kinds of shenanigans. (The exception I found was PDF.js, possibly due to the fact it was rendering the whole thing to HTML.) The only real issue was that every PDF reader had a slightly different idea of what characters I could safely use in the names of those nested documents.


Well yeah. Why is that horrifying? Give any Turing complete system whatsoever basic IO capabilities and you can make it compile C.


Yes, but why is a document format turning-complete in the first place


Because it evolved to also be a client-side form fill-in & validation etc. format. It’s quite similar to Javascript use for HTML forms.


It’s often hard to make it not be. Heck, the fonts themselves are probably Turing complete.


You might enjoy this then! https://www.gwern.net/Turing-complete


Would be more impressive it could still compile input to the form field after I print it out.


Now that's an intriguing concept. "A file format for declaratively specifying a physical data-communication artefact, abstractly-defined by the interactions it supports."

• Just showing the user text? Compiles to plaintext.

• Get the user to give some input? Compiles to a styled form, as PostScript.

• Add radio buttons? Compiles to a physical form but with a 3D-printed notched slider glued to it.

• Require validation for freeform-text form fields? Compiles to a 3D-print + VLSI + pick-and-place specification for a tablet embedded-device that displays the form and does the validation.

Now imagine a "printer" that takes such abstract documents as input, and can print any of these... :)


If it can't, must be a printer bug. Can't even print a PDF!


Maybe if you bury the page deep within a forest, so the compiler could hook into the distreebuted CPU cluster in order to facilitate more effective computation.


E-ink to the rescue.


i kinda have a noob question. doesn't a compiler just translate high level code to low level code?

it doesn't actually execute code, right? Then what's the power of having a compiler in a PDF? you can output the executable, but can you run it?

also, is the "input" and "output" of this compiler just code and executables?


I don't think the example had any practical use, really. I understood it more as an illustration of how weird Chrome's scripting support is: On the one hand, it lets you put programs as complex as a working C compiler in there - but on the other hand, interaction with the outside world is limited to putting stuff into text fields...

> also, is the "input" and "output" of this compiler just code and executables?

Mostly yes. I'm not sure how much of a typical build chain he was trying to convert to JS here, but the compiler itself typically takes a bunch of files with C code and outputs a number of "object files", which are really chunks of machine code. In an actual build process, you'd then use a linker to glue those object files together in the right way and make an executable.

I guess, what you could do if you wanted was to include the whole build chain (including linker) into the PDF, encode the executable as Base64 and fill some form field of the PDF with it. Then your workflow would be as follows:

1) Write some C code

2) Copy the C code into form field #1 if the PDF.

3) Hit a "compile" button or something. Form field #2 fills with what looks like an enormous amount of random gibberish (really the Base64-encoded binary)

5) Copy all the text from form field #2, use a program of your choice to decode the Base64 and save the decoded binary on your hard drive.

6) Run the binary on your hard drive and watch your C code execute. Hooray!


ah this was perfect and really cleared it up for me! thanks!


You compile C to « js ». Then pdf readers being able to execute js, you can basically execute C.


If you have an eval function or some kind of API to start that execution.


Eval() is part of the js language so you obviously do. But regardless you could make your own interpreter if neccessary. You could compile to x86 and then run it in your own VM if you felt like it.


Or run your own JavaScript interpreter, of course nesting interpreters that way is going to be horribly slow.


I mean, despite all the weirdness, it's all still run by Chrome's V8 in the end, so it might work...


> it doesn't actually execute code, right? Then what's the power of having a compiler in a PDF? you can output the executable, but can you run it?

Depends what you mean by "run", really. You can write a full-on X86 emulator, and execute a compiled binary there. But given that it's an emulator running in a nested series of sandboxes, it won't be terribly useful -- for example, it still won't have I/O capabilities.


This is truly the 9th circle of Hell...


Every time I see Adobe logo somewhere I just cringe a little bit. From the time that you had to have Acrobat Reader installed because most of pdfs created with Acrobat (writer) weren't really compatible with other readers, or that time that everything interactive on the web was in Flash (even our governmental websites for example Social Insurance Institution dropped Flash few days ago). My SO recently bought Adobe Lightroom and low and behold - you cannot install it on case sensitive filesystem (in 2020) and help page says: "well just install it on case insensitive filesystem". I'm quite surprised that they allow file names longer than eight characters, dot and three for file type...


The day Macromedia was bought and butchered by Adobe was a sad one for me.


How might things have been different had Macromedia remained independent?


Might been different a tough question to answer because it's hypotheticals all the way down. There is a different version of history where Macromedia's two biggest products, Flash and Dreamweaver took a different route, and neither died an ignominious death. Flash could have become an open web standard, driven by a programming language that isn't javascript, which we're all now forced to use due to browser support. Instead of using CSS for layout, we could be using something else. The cross-platform smartphone app ecosystem would look a whole lot different if iOS and Android both had built in Flash interpreters.

Does this all sound like a fantasy? It should, because it is. Absent the history of it actually happening and being able to point at that, the question is akin to comparing two sports teams across history, eg the 2014 Golden State Warriors to the 2002 Mavericks and trying to talk through which team would win.

Could an independent Macromedia have been better stewards of Flash than Adobe, leading to a world today where Flash wasn't deprecated? Absolutely. Would it have? We'll never know. Flash had a number of issues that lead to its death today, and it's not clear if an independent Macromedia, with a different internal developer and business culture from Adobe, could have fixed all of them resulting in a different future, or if they even needed to be fixed for that future to happen.

Looking at Adobe's poor stewardship of PDFs, however, it's hard to see positives to Adobe-owned Macromedia and Flash.


    Flash could have become an open web standard
    driven by a programming language that isn't 
    javascript, which we're all now forced to use 
    due to browser support.
Agreed on the impossibility of discussing what-ifs.

Obviously, they could have done anything. =)

Ultimately though I guess what I'm ultimately asking is if there were any hints about how Macromedia would have done things differently, had they remained independent, particularly in the direction of making Flash an open web standard.


If I'm not mistaken, Steam can't be installed on a case sensitive FS on a Mac.


Our neighbors at the fine journal of POC||GTFO are distinguished in PDF manipulation and polyglots. https://www.alchemistowl.org/pocorgtfo/


Awww! It's like phrack and 2600 had a pdf baby! How ugly!


I believe esoteric is a good word for the series as well.


Cromulent.


I had an employee once submit an algorithm document, written in pure Postscript.

The charts were actually executable Postscript, running the algorithm.

One of the coolest things I ever saw.


I think this is the Alan Kay future of computing. Right now we're in this weird hybrid state where we still work with digital documents primarily using the physical paper interface.

Imagine digital academic "papers" in STEM fields that natively ran the simulations the paper was describing. Jupyter sort of delivers that, but it still feels like early days for interactive digital-first documents (or as Steve Jobs has been credited for saying, "bicycles of the mind").


Why compute twice? Waste of resources. Some simulations also demand some serious hardware requirements that might not even be possible to run locally.


While a good point, at the moment the balance is much more shifted towards dead media rather than wasted resources. At best, the document doesn't get as much engagement as it could. At worst you get non-reproducible research papers, when you're really lucky if you can find the code in open access and compile it, let alone get the same results.

And sure, some simulations are very heavy, but they are more of exceptions. Also possible to have the best of both worlds, and have both a simulation, and a static snapshot available.


> At worst you get non-reproducible research papers

Ability to rerun programs is great, but we should be careful to remember that it's a different thing than reproducibility.


Often it’s the first step to reproducibility though. Am enormous amount of scientific effort is figuring out how a researcher did something they published.


Basically adding another whole project on top of the other project this way.

Imagine trying to figure out some 2001 JS paper thing for ex. But applied to every generation of technical development.

There’s always standards of course but we’ve seen those go sideways enough time to make one cringe at the thought of ‘dynamic papers’ via some new medium.

The kind of thing that sounds amazing on the surface then you remember the sort of crazy IT depts that thousands of universities run and forget the whole thing.


> Often it’s the first step to reproducibility though.

It shouldn't be! Reproduction needs to involve the interaction of human brain meats with a human level description of the solution. This is how we make sure that people aren't talking about something different than what was actually done, and how we make sure our conclusions are robust against the things we've failed to specify.

Imagine saying the same thing for physics: I start replication by running a time machine and using the same apparatus as the original experiment under the same conditions. Impracticality aside, this would be potentially useful to suss out fraud and certain kinds of errors, but what successful replication tells is is manifestly less powerful than successful replication on a new apparatus in a new location at a new time, with new values for everything we've failed to control.


Today was the first time I encountered a paper with a Docker image. Fantastic to be able to try it out with no effort.

I suppose this only works in a few fields though.


Not everything is resource constrained, though. Imagine being able to easily make interactive content that illustrates what you're trying to convey and allows the user to "play with it."

For things that are heavily resource constrained, it still could be a boon to have interactive access to the data that comes out of it.


Even if it's not practical to re-run all of the computation, in many cases it would be nice to have the output data stored in the document in a form where you can interact with it rather than just having static pixels.


It’s possible to also include the results, so no dilemma there. (I think current notebook formats already do this.)


Even for non-academic reporting: imagine if instead of 'dead' news articles or some tax reforms, or climate change, or whatever, you had an interactive model you could play with (and for example, plug in your own numbers if you disagree with some of the inputs).


Sorry to horrify anyone but we actually do this at work (mechanical engineering company) - JavaScript calculated component dimensions as form fields based on user input (e.g. pressure or load rating) overlaid on technical drawings.

Reason it's done in pdf is a lot of our technical is spat out in PDF format (generated from CAD - SolidWorks).

There are other options like Traceparts or setting up a variable input SolidWorks model to generate loads of static outputs, if you have the time and money.


Tons of articles in NYT, WaPo, FiveThirtyEight, and ProPublica have these. ProPublica also open-sources all their data and code on GitHub.


Good news: the software you describe exists since 1985.


I think now we have a lot of things like this-- we have Jupyter, Matlab, etc, to create engineer-centric general purpose interactive documents. We have labor-heavy ways to make end-user focused ones in the browser. We have spreadsheets.

But-- wouldn't it be cool if there was a way ordinary people could create interactive content to interact with data in a rich, intuitive way?


Why can’t ordinary people use Jupyter? Or put another way, what’s missing from Jupyter that would get ordinary people to use it?


> Why can’t ordinary people use Jupyter?

Because it's not installed, and they don't want to and shouldn't have to learn something new when there's something not new already at hand which suffices.

If you ever find yourself saying something like, "people can just do X" and wondering why they don't, turn it around and ask yourself, "why can't I just do Y?" In this case, that would be, "Why can't I just make my notebooks work in the viewers that everyone already has agreed upon using (i.e. the WHATWG/W3C hypertext system, i.e. the web browser) instead of asking them to futz around with installing and learning Jupyter?". When you start making excuses for why not, it's the moment you should be able understand another person's reasons for why not Jupyter.


My feelings about this aspect of Jupyter are two-fold:

1. On the creation side, it requires someone be comfortable with Python (or other Jupyter language) to some degree. Right now, programming is still considered a career skill rather than something "ordinary people" should be expected to know. Perhaps layering a graphical programming interface on top of this, which UE4 seems to have had some success with with their Blueprint system, would get "ordinary people" over the mental hurdle of being intimidated by code-as-text. Just look at the mental gymnastics people will engage with in Excel while thinking it's not programming.

I see this as more of a social problem than a technical one, at any rate.

2. Once you build an interactive Jupyter document (especially if you use interactive widgets), it's not necessarily that easy to share in its original state without requiring the reader also have a Jupyter environment set up or access a server running Jupyter. I would like to be able to share the document in a way that can be accessed offline by someone without them needing to set up the whole environment. Maybe an "Adobe Reader"-like application for Jupyter notebooks that "ordinary people" can just install with a click?


re #1: I think it's a technical problem too. I'm technically competent and enjoy programming, but I'd still like it if sometimes I could ask questions and get answers with less or no code. BI platforms are a pain in the ass for many reasons, but they often make it very easy to ask simple questions and organize the data in simple ways. A document that could do similar things without all the scaffolding would be cool.

#2-- Or just use the browser. It's capable enough, even if large datasets are somewhat problematic. The hard thing is the UI and identifying what the correct subset of functionality to surface is.


How can I send a Jupyter page as a standalone, offline document ?


Matlab doesn't even have proper text (Unicode) support… (And Octave even less so.)


Ok I'll byte (pun intended), of which software are you referring to here?


Sounds like a spreadsheet.


This already exists, with a focus on machine learning: https://distill.pub/


I would already settle for non-obsolete animation support :

- GIF is obsolete (~100x heavier than MP4 in my use-case, so out of the question)

- MP4 has poor support in PDF readers

(- Besides, PDF is not appropriate for electronic documents.)

- EPUB doesn't seem to support MP4 at all

(- EPUB does support PNG, not sure about APNG, will have to try it out...)

- MHTML=EML support has been dropped from browsers, which is completely baffling to me. There are alternatives like SingleFile, but they feel like dirty hacks : https://addons.mozilla.org/en-US/firefox/addon/single-file/

- What future for AV1 support ?


MHTML is a neat format, it's unfortunate it never got much steam. I think it could have been more popular if web browsers had defaulted to it when saving pages, rather than this weird html + _files/ directory (which on Windows is mysteriously linked so that when you delete one, you delete the other - no idea how they do that!).

What I've read of EPUB is also pretty disappointing. Seeing as it's a compiled format, once again, instead of going the zipfile + bunch of html inside + specific layout, we could have had a subset of html in .mhtml.gz with, like, metadata in a <script type="application/json" id="x-epub-metadata">. And then, guess what, web browsers could have been able to read it natively…


> (which on Windows is mysteriously linked so that when you delete one, you delete the other - no idea how they do that!)

Probably just some special code in Windows Explorer watching for the combo of .htm(l) file plus simlarly-named folder – via the command line I can delete just the HTML file or the folder separately without problems.


I’ve been surprised with what you can accomplish with data URIs. Embedding a MP4 can work great, but your text editor will likely hate it.


I've been doing a lot of research that applies here. The answer comes down to a few things:

1. using vector graphics wherever possible and then encoding it as SVG

2. if bitmap graphics are absolutely required and they can be procedurally generated, then do that

3. if large photographic data, video, or any other kind of data is required that can't be handled using the above steps, then separate that data set as you normally would using the file system directories, place the data set subtree into a ZIP archive, write your code so it references items by file paths relative to the ZIP, and then put your page into the root of the ZIP file, too, e.g. as index.html—your readers and reviewers follow along by using their system's native ZIP support to explore the contents of the ZIP file so they can locate index.html and then double click it, and index.html opens up with an "open dataset" button which you use to then feed in its own parent ZIP archive

The last part might sound complicated, but it's not much different from asking someone to use MS Office or VS Code or an IDE to open a file/project. (It's just that instead of requiring then to already have that IDE installed, you're giving them the IDE they need at the same time that they're getting the document/dataset they're actually interested in).

These approaches are robust enough that they're very unlikely to be broken by future browser changes. It's not that the tech is lacking right now, it's that human habits are lagging behind and we haven't yet established this as a cultural norm/protocol/expectation.


There are also other situations when the data is neither procedurally generated, nor large enough† to warrant this kind of treatment : photographs, video, (non-MIDI) sound …

†IMHO as long as your document doesn't cross 10 Mo, you shouldn't have to separate the data…


I don't understand your comment. It sounds like an argument against a process for manually creating these kinds of files, which is not at all what my comment was about. It was about accessibility, real-world engineering, and describing a file format/packaging convention.

The packaging convention I described is similar to the container formats used and created by MS Office apps. The difference is that DOCX, XLSX, etc rely on XML instead of HTML that can be used without requiring a separate proprietary app. People create and exchange those files every day (even for things as trivial as a single-page flyer) without knowing or caring about whether it should "warrant this kind of treatment". Worrying about a purported edge case for <10 MB(?) of data sounds like an imaginary concern.


My bad, I had indeed misunderstood what you were saying.


> Embedding a MP4 can work great, but your text editor will likely hate it.

Well, Libre Office Writer deals with (multiple, 100 Ko < size < 10 Mo) MP4 just fine. It's when the ODT is converted to PDF that most(?) PDF readers seem to be unable to read those MP4 properly.


> I’ve been surprised with what you can accomplish with data URIs.

Yeah, if I'm not mistaken, this is what SingleFile uses ?


Someone a year or two ahead of me in college put together something that calculated and printed a detailed Hilbert curve in PS - not that impressive now, but it took a hellaciously long time to run on the first LaserWriter with PS support.


IIRC that was worth doing because the first LaserWriters shipped with generally beefier configs than the general-purpose personal computers which sent jobs to them.


3D objects in PDFs are cool. My thesis used those in a few places. The PDF would print normally, but you could rotate it when open in Adobe Reader.

Getting this to work with Latex was... interesting. I spent a lot of time typesetting as a grad student.


Do you know of any automated way for extracting 3D objects in PDFs? My main profession is a dentist and I worked with various 3D and CAD/CAM system. I have intra-oral scanner that would capture 3D-colour model inside your mouth. The sad thing is the entire system is a walled garden. It uses its own 3D format (.dxd) and would only offer STL as an export format, which doesn’t contain any colors information. I worked around this by first exporting to a 3D PDF file. Then I use Sumatra PDF [1] to MANUALLY extract the 3D model in u3d format. U3D is a very obsolete format that almost no 3D authoring program can read it. So I have to use (yet another) proprietary software [2] to convert it to a common 3D format like PLY or 3DS or even to WebGL [3].

[1]: https://www.sumatrapdfreader.org/free-pdf-reader.html [2]: http://www.finalmesh.com/ [3]: https://khoadabest.surge.sh


Besides the other ideas in this subthread, the first thing that springs to mind for me is scanning a bunch of random objects, converting the models to as many 3D formats as you reasonably can, and dumping everything on GitHub along with reference photos of the objects.

I'm personally idly curious, but have no experience with reverse engineering or 3D or file formats... so the emphasis on my end is idle curiosity :). But it's possible that many such people poking around may still generate interesting leads.

Depending on how effectively intraoral scanners can scan things other than teeth, offering to scan random objects people send/bring in, on a best-effort/no-warranty basis, may also generate practical interest.

(Also, wow, looks like these things are in the $25k range?)


I think this is a pretty nice ideas. I will let you know once I've setup this. And just FYI, these expensive machines are actually at $50k. $25k range is for the scanner that has no color and requires you to coat the tooth with a layer of powder to prevent reflection from interfering with the scanner.


Wow, nice :) I can imagine color being incredibly helpful... and not needing powder certainly makes the process more user friendly and less intrusive.

Also, a very small extra thought, scanning extremely simple objects like cubes and flat planes may make the analysis process slightly easier because the data in the file will be easier to pars--wait. Okay I have more ideas.

Can you convert/import arbitrary 3D data into the proprietary dxd format? If there is any way to do this, there is nothing else that will move the analysis process as far forward as quickly, and offer the best chances of producing the most complete result. This is because a) the data files will have 99% less complexity due to being synthetic and not containing noise associated with real-world data, b) they'll be full of reference points from known 3d models, and c) entirely controllable input data gives the highest chance of figuring out all values/fields in the model files.

If this is possible, chances are most imports would be user requests based on the analysis process ("does changing this value alter this byte?"). Initial ideas I can think of would be the 3D Teapot, a single pixel :D, and simple cubes, triangles and planes.

Lastly, coordinating a backup installation of the scanning software onto a dedicated machine, or moving the main install onto such a machine, that enterprising reverse-engineers could connect to remotely (ideally at any time of day, and obviously after privately negotiating credentials) and install debugging tools (read: IDA/Ghidra/etc) onto, would likely be extremely helpful, and should provide the best "how we reversed this" narrative with regards to licensing. This would simplify the import request situation too.

If importing is not possible, IDA et al may end up being necessary to understand certain complex details or possibly even get started. Solving the "generate interest" problem would naturally be more complex in this scenario though. :/

I think I've really exhausted my knowledge in this area now :), although I do remain interested in knowing how things go.


Hey, thank you for replying to this old thread. I got sometimes to scan some fake models to eliminate any legal reasons for publishing real patients data on the Internet.

I published all data in this Github repo: https://github.com/thangngoc89/dxd-file-format

I also tried to scan something simple like an sphere or a pencil without any success. The software only recognize tooth-like structures.

Luciky, it can exports to STL files with 100% triangles that can be imported to others dental CAD software so I hope it would help with the progress.


Wow, the colorization the software provides is seriously impressive.

It's regrettable but understandable that the software only recognizes/accepts teeth considering the postprocessing it clearly does.

And CC0ing the model data is pretty much the textbook approach to analytical freedom :)

(And just to confirm, STL/etc->DXD isn't possible?)


Yes it is possible to go from STL to DXD. But last time I try that feature, the software crashed. I will try to do it again when I’m back at the office.

Thank you for reminding me about this.

Quick update: opening the DXD file with a hex editor, there is a XML file defining the metadata of the current file and a public RSA 1024 key. I’ve been scouring around to find the private key with no success.


Just saw this, sorry for the delay.

Hmmmm. Ideally that key is only being used for attestation/authentication, not encryption. In this case, you definitely don't want to locate the private key, because that key's confidentiality is what verifies the integrity of the scans made by your device.

Also, said private key might be specific to your copy of the software to create a chain of custody to your machine for medical purposes, or even more likely for licensing reasons.

In any case, if it's being used for encryption, that would amount to an unfortunate DRM situation that might be a bit of a hornet's nest to fiddle with, because of the high likeliness the key is being used for license enforcement etc (tracking scans made by copies of software deemed illegitimate etc).

It's very cool you can go from STL to DXD though. Now I'm curious, was the STL file that crashed the software originally generated from a DXD file created by the software? It originally being a DXD should be irrelevant, but chances are the pipeline inside the software chokes on things that aren't models of teeth. This does admittedly make the reverse engineering process trickier...


I don't, sorry. From what you wrote, you definitely seem more knowledgeable than me in this area.


Thank you for your input. I forgot the mention in the original post that there is a tool called pdf-parser.py [1] which claims to be able to do that but it produces a broken output. I don’t know anything about Python or PDF internals to hack on it. Posting it here and hoping that the HN crowds could point me in the correct direction.

[1]: https://blog.didierstevens.com/programs/pdf-tools/


I'd like to talk to you about this, but you don't have any contact details in your profile. You can find me email address in my profile.


Thank you very much. I updated my profile with an email address. Nevertheless, I emailed you via the contact details


Top tip: install mutool and run

mutool clean -d your.pdf clean.pdf

Now open clean.pdf with a text editor.


That's really a top tip! Thank you very much. It looks like the original file is compressed using FlatDecode. Passing through mutool decompresses all streams and let's the parser does its job.

Thank you!


Great! Glad it worked. Happy to help you unpick things a bit further. When you look inside the pdf file you’ll see that it’s actually a “tree” of “things”. Each one starts with “obj 0 1234” (or something like that). And they reference each other to build the structure. So for example, the document is made of a list of pages. So that’s one object. And each page is another object. And then each page is made of a bunch more stuff and so on. Somewhere in there, no doubt, you’ll find an object that’s your model.


Any chance you could get me some dxd files? I'd love to take a stab at reverse-engineering this and writing a direct converter to something standard. Feel free to email them to me (email in bio).


Absolutely yes! I don't have any files that doesn't contains sensitive patient informations on my laptop at the moment but I will create new files when I'm back at work on Monday. I will email you when I have the files.


I'd look at the 3D JS API via the JS bridge that I helped write: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdf...


Interestingly enough one of my early jobs was pretty much the opposite, writing a u3d encoder from spec and then using a commercial c++ pdf manipulation library to inject them into pdf files.

I am sure there are libraires un python or JS nowadays, it’s just a question of parsing the tree to find the u3d node and dumping it out, very simple


HN is really the only place that you can ask a question and received answers from someone who actually worked on the problem before. And you're correct that all I need to do is find u3d node and dumping it out. See my response in parent thread about using pdf-parser.py.


Getting it to work in Latex is easy if you use Asymptote: https://asymptote.sourceforge.io/

Back in 2011 I used it to make a whole lot of figures for a multivariable calculus course; they're still in use.


I am considering animations for my thesis. When printing, a designated frame should be used, but inside a reader, the animation should work. I am writing my thesis in LaTeX too. Any pointers?


You probably better invest the time in the preparation of a couple of beautiful Jupyter notebooks. That's where people expect interactivity and code to happen, not in PDFs. In my scientific community, virtually nobody uses Adobe Reader (people on Mac use Preview.App, people on Linux use poppler/xpdf/evince, browsers have their own internal readers).

Edit/Appendum: Crafting an interactive website (i.e. without the dependency on jupyter) might be more future proof.


Idyll (https://idyll-lang.org/) is a very promising tool in this field.


Thanks for the link! I think this is tightly connected to Donald Knuth's Literate programming paradigm, i.e. there are also platforms which generate beautiful reports out of the embedded comments in your traditional source codes. However, you won't get the interactivity for free.

I personally prefer Jupyter because it seperates the programming language (Python, R, C++, etc.) from the representation (for instance Web) and still allows interactivity for a certain degree (given a backend running the source code).


This is an interesting project, thank you!


He didn't say anything about interactivity though. But even a lower bar than this : just animation, is not currently cleared by the available document formats.

(And a website doesn't fit the requirements as it's not contained in a single file, so its archival is a lot more complicated.)


Unfortunately my school expects a PDF thesis. But you make a good point about popular alternate readers not supporting animations - maybe this is a wasted effort. Thank you! Probably better to link to notebooks or videos of the animations on vimeo/youtube.


Yes, PDF is the standard. Some people get it managed to generate HTML out of their PDFLatex code. This could be a nice starting point for an enriched reading experience in the web browser. However, with standard print-first latex, I never could generate HTML for any nontrivial documents (especially large documents with many hundred pages).

Putting your animations in traditional video formats (mp4, ogg) or on vimeo/youtube is probably the best way to make them accessible for most people. Many scientific labs have their own youtube channels.


Many 3D authoring program like Blender, Meshmixer can output U3D or RPC that you can use to embed into 3D PDF files. There are just many tools that can read the format. But beware that only Adobe Reader can show the 3D object


Thank you! Yes, I wasn't thinking about the read-time support.


Note that PDF 3D models have a static image (a bitmap) which readers that don't support 3D (most of them) will show instead. Actually Adobe Reader shows the static image too, until you click on it to activate the 3D rendered version.


I had a similar issue recently (for a much smaller project though). The sad reality is that it looks like that we currently don't have an actual, working, properly supported standard for electronic documents, which would include something as (relatively) basic as animation support : https://news.ycombinator.com/item?id=25612066


PDF attachments are very useful for lossless steganography. Image-based techniques get lost in recompression (e.g. Save To Camera Roll on an iPhone, or sending via Facebook message). PDF attachments don't get lost in that way.

Want to include the CSV raw data with your report? Just add it as a PDF attachment.

Want to hide a game with your homework? Add it as a PDF attachment. Chrome and Preview on Mac doesn't show that it exists, but Firefox can be used to extract the file.

It's not going to shock anyone to have a 5 MB file as a PDF, but there's a lot you can hide in there (MP3s, games, HTML files including more JavaScript, whatever else).

On the surface, everyone thinks it's just another PDF. But the real data is hiding in plain sight.


It's not going to shock anyone to have a 5 MB file as a PDF

If I see a PDF containing a page or so of text and it turns out to be several MB, I would become a little suspicious. But you're right that most people are not aware of the general size of things.


With PDF, it would not be hard to obscure the presence of only a minimum of legitimate content using many page-breaks to give the illusion of a long document, and filling those pages with space-hogs such as large headers, tables, and algorithmically generated graphics. The sparsity of genuine content would not be too surprising given that PDF was originally intended for printing rather than reading.


Just add a few pictures that double the weight, and nobody is going to notice.


Using a standard feature to embed data isn't steganography.

It's probably underutilized overall, but there's nothing hidden about it when most viewers show the data.


Fair enough that it isn't steganography, though it can be used for similar applications.

Do "Most viewers" show it? Google Chrome doesn't, nor does Preview on a Mac. There is no easy way to add attachments to a PDF, except Adobe Acrobat Pro or iTextPDF. Firefox and Adobe Reader can read the attachments, but it's "hidden" to some degree, inside slide-out side menus. Certainly enough to avoid a casual glance.


Okular alerts you to the presence of attachments, IIRC. I wouldn't say "most" though.


Related: Postscript is a great stackbased language to learn to program in. A good initial exercise is to write a factorial function:

https://www-cdf.fnal.gov/offline/PostScript/BLUEBOOK.PDF

http://paulbourke.net/dataformats/postscript/


This is demoscene-worthy.

and it gets replaced with a basic filled and bordered rectangle.

...also known as a pixel ;-)


Not too horrifying: when I open the Breakout PDF in Preview, it just displays a white page.

PDFs are a great format if you just ignore the dumb parts. :)


They’re not, though. Just try extracting the document structure to e.g. power an accessibility system like a screen reader, and you rapidly find out that the text is an unstructured bag of characters and positions with no semantic information at all. No paragraphs, no marked headings, not even word boundaries. You have to attempt to infer from proximity and relative sizing.


I'm not convinced there's anything you can do about that without losing what makes PDF such a useful format. One of the great things about a PDF is that you can drop a few pieces of paper into a scanner and end up with a PDF in seconds. That wouldn't be possible if you had to care about the underlying markup, as you do when e.g. writing html.

Adobe does have tools for creating PDFs that are accessibility-friendly, but it can take hours of work. As much as it sucks for certain audiences, it just doesn't make sense to do that in the general case.


Whatever source document you just scanned was almost certainly authored in a structured format.


To be fair, Preview’s handling of PDFs is somewhat horrifying itself.


PDF/A is the true PDF! Strips all the bloat away.


And fortunately we nowadays have validators that could be used to reject files with non-PDF/A features: https://verapdf.org/

Hypothetically a compliant reader is supposed to ignore non-PDF/A features encountered in files that declare themselves as PDF/A, so I've sometimes wondered if a cheap form of "sanitizing" PDFs would be to simply force their PDF/A flags on.


> https://verapdf.org/

That is one shitty site. Trying to shove Google Analytics down my throat, no contact information, no privacy page. Probably illegal under GDPR.

> so I've sometimes wondered if a cheap form of "sanitizing" PDFs would be to simply force their PDF/A flags on.

That's not really how PDF-standards work. You'll have to "rewrite" the problematic parts, the standards are just for checking against the pre-defined ruleset.

In professional media production we do this "rewrite" all the time (PDF/X-standard). Though sometimes PDF files are just so "broken" that it's impossible to fix them.


> That is one shitty site.

Yes, I don't think it gets much attention - I should probably have pointed at the github org which is reasonably active. https://github.com/verapdf

> That's not really how PDF-standards work.

Well, it is how the standard works (don't make me dig out the relevant bit of what's publicly available from the standard) - the issue is whether common PDF readers actually do what they're "supposed to" or whether they just try and interpret as much as they can.


I worked on two aspects of this in my most recent position. I was responsible for implementing the javascript APIs and the feature of embedded abitrary compressed file attachments in a web based PDF editing SDK according to the lengthy pdf spec. It was an interesting technical challenge and eye-opening experience in terms of what I learned PDFs were capable of, and my immediate concern was some of the stuff this git repo talks about.


I have written an application in PostScript once.

As a newbie developer I decided to use PostScript to generate badges for all our employees. There was a list of employee names in a text file, there was a PostScript file with the program and a Perl script to join them together.

The PostScript program would take the names, generate 8 badges per A4 page, scale the name of the employee so that it fits the space perfectly, generate procedural background, etc.


This PDF triggers stuttering and then a resource-overuse tabkill for me on iOS, which is kind of impressive for a blank page.


This is exactly what I hate about Adobe. They're always cramming way too much functionality into their plugins making them too heavy and riddled with security issues.

This is like flash player all over again. No way am I going to enable the proper pdf reader for web content view. There's a good reason browsers refuse to support all this


This is FlashPlayer.

The underlying JavaScript-like language is ActionScript which was originally developed by MacroMedia to provide animation for Flash.

It's quite useful for creating PDF SmartForms that adjust their contents based on the user's responses. Until very recently they were only viable in the official Abobe Reader until Chrome decided to add support.

As far as security, blame Chrome for not incorporating an opt-in before allowing a particular PDF to run ActionScript in the browser.


Actionscript is not the flash player though.

It is just a scripting language. So did they actually use flash player tech?


This is one of the clearest examples of feature creep I've ever seen. PDF is, as the name clearly implies, a protocol for portable documents. Yet it has grown over the years to be a defacto form protocol, with capabilities to do way more than a portable document should.


And it's not even a protocol for electronic documents, but ones replicating paper documents !


Question to OP (thesephist): did you also get to this by checking Omar's profile from the TabOS link yesterday?


Some people do all the insane things :)


we had a collection of these internally in the early 2000s using notes, even mandelbrot sets using embedded ps based fonts. a lot of this comes from dynamic form requirements. the JS engine was from the latest mozilla engine for the time when it came out, spidermonkey.


I've seen companies that use a fair amount of the PDF specification before. One of the most impressive was 3D models and scripted UI elements baked into the document. It kind of made the document look like JSCAD, with an actual 3D model you could manipulate.


I didn't expect this to be as 'Horrifying' as it was. Has anyone written a script yet to identify whether or not a given PDF contains executable script?


QubesOS has a "TrustedPDF converter" [0] that sanitizes a PDF to the extreme level for ultimate security - it converts the entire PDF to RGB pixmaps in an isolated virtual machine. The author has a blog post at [1]. Obviously you lose the ability to use the menu, search, copy or paste, but it's as 0day-proof as you can get for a horrifying file format.

[0] https://github.com/QubesOS/qubes-app-linux-pdf-converter

[1] https://blog.invisiblethings.org/2013/02/21/converting-untru...



pdfinfo in poppler does this


The remaining JS API in the Chrome viewer is to support enterprise users with JS form validation.


Flash is dead, long live the pdf!


I just knew that Chrome PDF engine was from Foxit. anyone has more detail about this?


The article links to a Google+ post for some reason and I can't find any other info. Isn't Foxit proprietary? I haven't used Chrome or Chromium in a while, but last time I did, they seemed to use the same PDF viewer. How could they include it in Chromium?


Chrome open sourced Foxit as Pdfium: https://www.foxitsoftware.com/blog/?p=641


Thanks, to be precise it's only the engine part of Foxit, the generator and renderer. The foxit itself is still pretty much proprietary software.

From the last 7 weeks commit history, I see only people associated with chromium. So may be foxit is not involved in developing pdfium anymore, or may be they're not developing in open and sync in once in a while.

https://pdfium.googlesource.com/pdfium/+log


can someone please explain to me the power of embedding a c compiler into a pdf?

doesnt a compiler just output executables? would we be able to run these executables? where would these executables get stored?


Breaking out of the sandbox easily. Check the OS, do syscalls to read and write to the filesystem, install a reverse shell and CC.


This doesn't make sense. A C compiler that has been compiled to javascript is still just a regular javascript program. It's not given special access to anything.


The JS based C compiler has no access outside the browser sandbox, but it is capable of generating actual executables which could potentially break out of the walled garden. That is why Microsoft ActiveX was deprecated as a security hazard and why their original proprietary browser was known as Internet Exploder.


Why go through the trouble of generating a blob of executable code at runtime rather than just including it in the JS source? The security guarantees are the same.


i now know exactly how to show off on college application resumés...think MIT uses Chrome? finally something to make up for my GPA.


A long time ago I wrote my résumé in PostScript. The text was in an abstract representation, to which an internal typesetting system applied paragraph and page filling and converted it into drawing commands, and could output to plain text for emailing, and HTML, script to mail itself, etc. I thought for sure Adobe would give me a job, but I don't think anyone ever saw it, because who would ever look inside a such a thing? It became thouroughly irrelevant when everything became PDF. I'm not sure I even sent it to Adobe.

So if you do such a thing, realize you might only be doing it for your own enjoyment.


The PDF thickens...


One of my good friends did a lot of research on PDFs as part of his graduate research. Older versions of Adobe Writer (maybe even the current one too?) would always append and never overwrite. So if you edited pages, it would add those edits to the bottom of the file. As long as you did everything in the Writer workflow and didn't Save As a new file, you could see a history of old edits. You can even find stuff that's blacked out in some government documents.


I cannot recommend qpdf [1] enough if you want to play around with PDFs.

Aside from being an excellent pdf manipulation library it also has a mode where it outputs a version of the pdf that is much easier to manipulate with a text editor and then lets you build a new pdf from that.

Shout out to Jay who has been steadily working on it for many many years. He is the most kind, undestanding and hard working free software developer I've had the pleasure to cross paths with. Thanks for all your hard work Jay!

[1] https://github.com/qpdf/qpdf


I don't see an ability to view PDF version history in this tool. Am I missing something?


From the docs:

QPDF is not a PDF content creation library, a PDF viewer, or a program capable of converting PDF into other formats.


But doesn't being able to see all streams mean that I can see edited/erased content from prior states of the document?


not necessarily, there's nothing that says that old content is preserved in inaccessible streams... the entire PDF file can be re-written discarding all old content.


This is by design and not surprising at all if you read even a tiny bit about PDF. It's in fact the default save method in nearly every PDF capable software. Rewriting the PDF is in fact the less common method. I'm surprised a researcher of PDF would be surprised by that.

However, if you are using a tool like a redaction tool then the software should forbid you from writing in append mode. This was a common error in old PDF apps and perhaps contemporary ones that are new.

Edit for politeness:

My surprise is aimed at the researcher, not you :)


The person you're replying to didn't do the research themselves. They said as much.

They were just sharing something they were surprised by/interested in. I was surprised to read that's how editing PDFs works too.


Yeah sorry if I'm unclear or misinterpreting. My comment is about the researcher not the person I replied to.

I do agree that it's surprising behaviour to regular users of PDF that it usually maintains a history of sorts. Apps should make this clearer.


The researcher probably wasnt surprised, it looks like the person you replied to was surprised. Perhaps i was surprised that you were surprised? :)


I'm surprised by all these appended suprises revealing a history of surprise :D


Surprisingly, the researcher might also have been surprised the first time they found out.


“You” is ambiguous in English.

https://www.merriam-webster.com/dictionary/you:

  1. the one or ones being addressed 
  2. ONE sense 2a (which is “being one in particular”)
So, pro tip: in chat-like discussions with strangers such as hacker news, one should prefer saying “one” when using sense 2, even if it sounds a bit archaic (at least to me. Is it?)

Also, when reading a “you” that could be interpreted both ways, do not assume it is used in sense 1.


"One" is very archaic, I always fear it will be confusing for non-native speakers, and sound stuck-up to native speakers, and tend to avoid it.


Where did they say the researcher friend was surprised? Where did they say they were surprised?


Wow that's kind of interesting and the least bit surprising.

Wasn't there a search engine built into finding redacted PDF content? I think it made the headlines here a while back.


Searching for "pdf search" isn't finding anything significant.

"PDF drive" (https://news.ycombinator.com/item?id=25240373, 0 comments) just appears to be an ebook crawler over in the less-than-#FFFFFF-department if you get what I mean.

I also found a thread talking about searching PDFs for specific queries (https://news.ycombinator.com/item?id=10154527) which appears to have generated some interesting results back when the thread was posted, in 2015.

Not seeing anything recent though. But on the subject of a search engine specifically for finding redacted content, I couldn't help but imagine the discussion...

"Hi, I would like to find a •••••••."

"You specifically want a •••••••?"

"Yes, literally."

[Person 2 walks away scratching their head wondering what person 1 would do with a 'hunter2']


You might be remembering `Google PDF Search: “not for public release”` from 2015 [1] and 2019 [2].

[1] https://news.ycombinator.com/item?id=10154527

[2] https://news.ycombinator.com/item?id=20420209


might have been the one on reversing pixelation?


Isn’t programming fun?

There’s always a scary world lurking underneath it seems.


sounds more like an intentional backdoor


Not really, this type of save changes at the end used to be fairly common (i assume for performance reasons on big docs back when computers were much more constrained) microsoft word did the same thing back in the day.


It's not only common it's still the way PDFs are usually saved. Open a PDF in a text editor (PDFs are text not binary files) and you can see any edits appended as "trailers".


This sounds like a feature that could be exploited in creative ways in either a product or some fun side project. I don't know what it is exactly but there's something aout never editing edit logs (possible not being obvious to the user as a factor) plus some graphical UI representation or UX flow (besides undo).


I remember reading that the early versions of MS Word did the same thing, for performance reasons.


Blockchain FTW!


I was told once by someone in infosec that the PDF spec included a dos emulator for some abstract thing.

That doesn’t appear to be true exactly but isn’t anywhere in the realm of impossible which is a serious issue for PDF.

I was hoping FoxIt dropped a lot of the BS spec parts, but it seems they don’t want to “lose out” to Acrobat in the features checklist. At least I know it’s easy to turn JS off by GPO with FoxIt, Acrobat I assume too?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: