"It might be possible to embed a C compiler into a PDF by compiling it to JS with Emscripten, for example, but then your C compiler has to take input through a plain-text form field and spit its output back through a form field."
Even just basic text is... interesting in PDFs. A few years back, I created a PDF which displayed its own MD5 hash by making every single letter a separate chain of sub-documents and using MD5 collisions to change which sub-document each pointed to without changing the hash. Pretty much every PDF reader managed to interpret this cleanly as ordinary, copy-and-pasteable text because it wasn't any worse than they could expect to encounter in an ordinary PDF, and they all had robust heuristics for dealing with these kinds of shenanigans. (The exception I found was PDF.js, possibly due to the fact it was rendering the whole thing to HTML.) The only real issue was that every PDF reader had a slightly different idea of what characters I could safely use in the names of those nested documents.
Now that's an intriguing concept. "A file format for declaratively specifying a physical data-communication artefact, abstractly-defined by the interactions it supports."
• Just showing the user text? Compiles to plaintext.
• Get the user to give some input? Compiles to a styled form, as PostScript.
• Add radio buttons? Compiles to a physical form but with a 3D-printed notched slider glued to it.
• Require validation for freeform-text form fields? Compiles to a 3D-print + VLSI + pick-and-place specification for a tablet embedded-device that displays the form and does the validation.
Now imagine a "printer" that takes such abstract documents as input, and can print any of these... :)
Maybe if you bury the page deep within a forest, so the compiler could hook into the distreebuted CPU cluster in order to facilitate more effective computation.
I don't think the example had any practical use, really. I understood it more as an illustration of how weird Chrome's scripting support is: On the one hand, it lets you put programs as complex as a working C compiler in there - but on the other hand, interaction with the outside world is limited to putting stuff into text fields...
> also, is the "input" and "output" of this compiler just code and executables?
Mostly yes. I'm not sure how much of a typical build chain he was trying to convert to JS here, but the compiler itself typically takes a bunch of files with C code and outputs a number of "object files", which are really chunks of machine code. In an actual build process, you'd then use a linker to glue those object files together in the right way and make an executable.
I guess, what you could do if you wanted was to include the whole build chain (including linker) into the PDF, encode the executable as Base64 and fill some form field of the PDF with it. Then your workflow would be as follows:
1) Write some C code
2) Copy the C code into form field #1 if the PDF.
3) Hit a "compile" button or something. Form field #2 fills with what looks like an enormous amount of random gibberish (really the Base64-encoded binary)
5) Copy all the text from form field #2, use a program of your choice to decode the Base64 and save the decoded binary on your hard drive.
6) Run the binary on your hard drive and watch your C code execute. Hooray!
Eval() is part of the js language so you obviously do. But regardless you could make your own interpreter if neccessary. You could compile to x86 and then run it in your own VM if you felt like it.
> it doesn't actually execute code, right? Then what's the power of having a compiler in a PDF? you can output the executable, but can you run it?
Depends what you mean by "run", really. You can write a full-on X86 emulator, and execute a compiled binary there. But given that it's an emulator running in a nested series of sandboxes, it won't be terribly useful -- for example, it still won't have I/O capabilities.
Every time I see Adobe logo somewhere I just cringe a little bit. From the time that you had to have Acrobat Reader installed because most of pdfs created with Acrobat (writer) weren't really compatible with other readers, or that time that everything interactive on the web was in Flash (even our governmental websites for example Social Insurance Institution dropped Flash few days ago).
My SO recently bought Adobe Lightroom and low and behold - you cannot install it on case sensitive filesystem (in 2020) and help page says: "well just install it on case insensitive filesystem". I'm quite surprised that they allow file names longer than eight characters, dot and three for file type...
Might been different a tough question to answer because it's hypotheticals all the way down. There is a different version of history where Macromedia's two biggest products, Flash and Dreamweaver took a different route, and neither died an ignominious death. Flash could have become an open web standard, driven by a programming language that isn't javascript, which we're all now forced to use due to browser support. Instead of using CSS for layout, we could be using something else. The cross-platform smartphone app ecosystem would look a whole lot different if iOS and Android both had built in Flash interpreters.
Does this all sound like a fantasy? It should, because it is. Absent the history of it actually happening and being able to point at that, the question is akin to comparing two sports teams across history, eg the 2014 Golden State Warriors to the 2002 Mavericks and trying to talk through which team would win.
Could an independent Macromedia have been better stewards of Flash than Adobe, leading to a world today where Flash wasn't deprecated? Absolutely. Would it have? We'll never know. Flash had a number of issues that lead to its death today, and it's not clear if an independent Macromedia, with a different internal developer and business culture from Adobe, could have fixed all of them resulting in a different future, or if they even needed to be fixed for that future to happen.
Looking at Adobe's poor stewardship of PDFs, however, it's hard to see positives to Adobe-owned Macromedia and Flash.
Flash could have become an open web standard
driven by a programming language that isn't
javascript, which we're all now forced to use
due to browser support.
Agreed on the impossibility of discussing what-ifs.
Obviously, they could have done anything. =)
Ultimately though I guess what I'm ultimately asking is if there were any hints about how Macromedia would have done things differently, had they remained independent, particularly in the direction of making Flash an open web standard.
I think this is the Alan Kay future of computing. Right now we're in this weird hybrid state where we still work with digital documents primarily using the physical paper interface.
Imagine digital academic "papers" in STEM fields that natively ran the simulations the paper was describing. Jupyter sort of delivers that, but it still feels like early days for interactive digital-first documents (or as Steve Jobs has been credited for saying, "bicycles of the mind").
While a good point, at the moment the balance is much more shifted towards dead media rather than wasted resources. At best, the document doesn't get as much engagement as it could. At worst you get non-reproducible research papers, when you're really lucky if you can find the code in open access and compile it, let alone get the same results.
And sure, some simulations are very heavy, but they are more of exceptions. Also possible to have the best of both worlds, and have both a simulation, and a static snapshot available.
Often it’s the first step to reproducibility though. Am enormous amount of scientific effort is figuring out how a researcher did something they published.
Basically adding another whole project on top of the other project this way.
Imagine trying to figure out some 2001 JS paper thing for ex. But applied to every generation of technical development.
There’s always standards of course but we’ve seen those go sideways enough time to make one cringe at the thought of ‘dynamic papers’ via some new medium.
The kind of thing that sounds amazing on the surface then you remember the sort of crazy IT depts that thousands of universities run and forget the whole thing.
> Often it’s the first step to reproducibility though.
It shouldn't be! Reproduction needs to involve the interaction of human brain meats with a human level description of the solution. This is how we make sure that people aren't talking about something different than what was actually done, and how we make sure our conclusions are robust against the things we've failed to specify.
Imagine saying the same thing for physics: I start replication by running a time machine and using the same apparatus as the original experiment under the same conditions. Impracticality aside, this would be potentially useful to suss out fraud and certain kinds of errors, but what successful replication tells is is manifestly less powerful than successful replication on a new apparatus in a new location at a new time, with new values for everything we've failed to control.
Not everything is resource constrained, though. Imagine being able to easily make interactive content that illustrates what you're trying to convey and allows the user to "play with it."
For things that are heavily resource constrained, it still could be a boon to have interactive access to the data that comes out of it.
Even if it's not practical to re-run all of the computation, in many cases it would be nice to have the output data stored in the document in a form where you can interact with it rather than just having static pixels.
Even for non-academic reporting: imagine if instead of 'dead' news articles or some tax reforms, or climate change, or whatever, you had an interactive model you could play with (and for example, plug in your own numbers if you disagree with some of the inputs).
Sorry to horrify anyone but we actually do this at work (mechanical engineering company) - JavaScript calculated component dimensions as form fields based on user input (e.g. pressure or load rating) overlaid on technical drawings.
Reason it's done in pdf is a lot of our technical is spat out in PDF format (generated from CAD - SolidWorks).
There are other options like Traceparts or setting up a variable input SolidWorks model to generate loads of static outputs, if you have the time and money.
I think now we have a lot of things like this-- we have Jupyter, Matlab, etc, to create engineer-centric general purpose interactive documents. We have labor-heavy ways to make end-user focused ones in the browser. We have spreadsheets.
But-- wouldn't it be cool if there was a way ordinary people could create interactive content to interact with data in a rich, intuitive way?
Because it's not installed, and they don't want to and shouldn't have to learn something new when there's something not new already at hand which suffices.
If you ever find yourself saying something like, "people can just do X" and wondering why they don't, turn it around and ask yourself, "why can't I just do Y?" In this case, that would be, "Why can't I just make my notebooks work in the viewers that everyone already has agreed upon using (i.e. the WHATWG/W3C hypertext system, i.e. the web browser) instead of asking them to futz around with installing and learning Jupyter?". When you start making excuses for why not, it's the moment you should be able understand another person's reasons for why not Jupyter.
My feelings about this aspect of Jupyter are two-fold:
1. On the creation side, it requires someone be comfortable with Python (or other Jupyter language) to some degree. Right now, programming is still considered a career skill rather than something "ordinary people" should be expected to know. Perhaps layering a graphical programming interface on top of this, which UE4 seems to have had some success with with their Blueprint system, would get "ordinary people" over the mental hurdle of being intimidated by code-as-text. Just look at the mental gymnastics people will engage with in Excel while thinking it's not programming.
I see this as more of a social problem than a technical one, at any rate.
2. Once you build an interactive Jupyter document (especially if you use interactive widgets), it's not necessarily that easy to share in its original state without requiring the reader also have a Jupyter environment set up or access a server running Jupyter. I would like to be able to share the document in a way that can be accessed offline by someone without them needing to set up the whole environment. Maybe an "Adobe Reader"-like application for Jupyter notebooks that "ordinary people" can just install with a click?
re #1: I think it's a technical problem too. I'm technically competent and enjoy programming, but I'd still like it if sometimes I could ask questions and get answers with less or no code. BI platforms are a pain in the ass for many reasons, but they often make it very easy to ask simple questions and organize the data in simple ways. A document that could do similar things without all the scaffolding would be cool.
#2-- Or just use the browser. It's capable enough, even if large datasets are somewhat problematic. The hard thing is the UI and identifying what the correct subset of functionality to surface is.
MHTML is a neat format, it's unfortunate it never got much steam. I think it could have been more popular if web browsers had defaulted to it when saving pages, rather than this weird html + _files/ directory (which on Windows is mysteriously linked so that when you delete one, you delete the other - no idea how they do that!).
What I've read of EPUB is also pretty disappointing. Seeing as it's a compiled format, once again, instead of going the zipfile + bunch of html inside + specific layout, we could have had a subset of html in .mhtml.gz with, like, metadata in a <script type="application/json" id="x-epub-metadata">. And then, guess what, web browsers could have been able to read it natively…
> (which on Windows is mysteriously linked so that when you delete one, you delete the other - no idea how they do that!)
Probably just some special code in Windows Explorer watching for the combo of .htm(l) file plus simlarly-named folder – via the command line I can delete just the HTML file or the folder separately without problems.
I've been doing a lot of research that applies here. The answer comes down to a few things:
1. using vector graphics wherever possible and then encoding it as SVG
2. if bitmap graphics are absolutely required and they can be procedurally generated, then do that
3. if large photographic data, video, or any other kind of data is required that can't be handled using the above steps, then separate that data set as you normally would using the file system directories, place the data set subtree into a ZIP archive, write your code so it references items by file paths relative to the ZIP, and then put your page into the root of the ZIP file, too, e.g. as index.html—your readers and reviewers follow along by using their system's native ZIP support to explore the contents of the ZIP file so they can locate index.html and then double click it, and index.html opens up with an "open dataset" button which you use to then feed in its own parent ZIP archive
The last part might sound complicated, but it's not much different from asking someone to use MS Office or VS Code or an IDE to open a file/project. (It's just that instead of requiring then to already have that IDE installed, you're giving them the IDE they need at the same time that they're getting the document/dataset they're actually interested in).
These approaches are robust enough that they're very unlikely to be broken by future browser changes. It's not that the tech is lacking right now, it's that human habits are lagging behind and we haven't yet established this as a cultural norm/protocol/expectation.
There are also other situations when the data is neither procedurally generated, nor large enough† to warrant this kind of treatment : photographs, video, (non-MIDI) sound …
†IMHO as long as your document doesn't cross 10 Mo, you shouldn't have to separate the data…
I don't understand your comment. It sounds like an argument against a process for manually creating these kinds of files, which is not at all what my comment was about. It was about accessibility, real-world engineering, and describing a file format/packaging convention.
The packaging convention I described is similar to the container formats used and created by MS Office apps. The difference is that DOCX, XLSX, etc rely on XML instead of HTML that can be used without requiring a separate proprietary app. People create and exchange those files every day (even for things as trivial as a single-page flyer) without knowing or caring about whether it should "warrant this kind of treatment". Worrying about a purported edge case for <10 MB(?) of data sounds like an imaginary concern.
> Embedding a MP4 can work great, but your text editor will likely hate it.
Well, Libre Office Writer deals with (multiple, 100 Ko < size < 10 Mo) MP4 just fine. It's when the ODT is converted to PDF that most(?) PDF readers seem to be unable to read those MP4 properly.
Someone a year or two ahead of me in college put together something that calculated and printed a detailed Hilbert curve in PS - not that impressive now, but it took a hellaciously long time to run on the first LaserWriter with PS support.
IIRC that was worth doing because the first LaserWriters shipped with generally beefier configs than the general-purpose personal computers which sent jobs to them.
Do you know of any automated way for extracting 3D objects in PDFs? My main profession is a dentist and I worked with various 3D and CAD/CAM system. I have intra-oral scanner that would capture 3D-colour model inside your mouth. The sad thing is the entire system is a walled garden. It uses its own 3D format (.dxd) and would only offer STL as an export format, which doesn’t contain any colors information.
I worked around this by first exporting to a 3D PDF file. Then I use Sumatra PDF [1] to MANUALLY extract the 3D model in u3d format.
U3D is a very obsolete format that almost no 3D authoring program can read it. So I have to use (yet another) proprietary software [2] to convert it to a common 3D format like PLY or 3DS or even to WebGL [3].
Besides the other ideas in this subthread, the first thing that springs to mind for me is scanning a bunch of random objects, converting the models to as many 3D formats as you reasonably can, and dumping everything on GitHub along with reference photos of the objects.
I'm personally idly curious, but have no experience with reverse engineering or 3D or file formats... so the emphasis on my end is idle curiosity :). But it's possible that many such people poking around may still generate interesting leads.
Depending on how effectively intraoral scanners can scan things other than teeth, offering to scan random objects people send/bring in, on a best-effort/no-warranty basis, may also generate practical interest.
(Also, wow, looks like these things are in the $25k range?)
I think this is a pretty nice ideas. I will let you know once I've setup this. And just FYI, these expensive machines are actually at $50k. $25k range is for the scanner that has no color and requires you to coat the tooth with a layer of powder to prevent reflection from interfering with the scanner.
Wow, nice :) I can imagine color being incredibly helpful... and not needing powder certainly makes the process more user friendly and less intrusive.
Also, a very small extra thought, scanning extremely simple objects like cubes and flat planes may make the analysis process slightly easier because the data in the file will be easier to pars--wait. Okay I have more ideas.
Can you convert/import arbitrary 3D data into the proprietary dxd format? If there is any way to do this, there is nothing else that will move the analysis process as far forward as quickly, and offer the best chances of producing the most complete result. This is because a) the data files will have 99% less complexity due to being synthetic and not containing noise associated with real-world data, b) they'll be full of reference points from known 3d models, and c) entirely controllable input data gives the highest chance of figuring out all values/fields in the model files.
If this is possible, chances are most imports would be user requests based on the analysis process ("does changing this value alter this byte?"). Initial ideas I can think of would be the 3D Teapot, a single pixel :D, and simple cubes, triangles and planes.
Lastly, coordinating a backup installation of the scanning software onto a dedicated machine, or moving the main install onto such a machine, that enterprising reverse-engineers could connect to remotely (ideally at any time of day, and obviously after privately negotiating credentials) and install debugging tools (read: IDA/Ghidra/etc) onto, would likely be extremely helpful, and should provide the best "how we reversed this" narrative with regards to licensing. This would simplify the import request situation too.
If importing is not possible, IDA et al may end up being necessary to understand certain complex details or possibly even get started. Solving the "generate interest" problem would naturally be more complex in this scenario though. :/
I think I've really exhausted my knowledge in this area now :), although I do remain interested in knowing how things go.
Hey, thank you for replying to this old thread. I got sometimes to scan some fake models to eliminate any legal reasons for publishing real patients data on the Internet.
Yes it is possible to go from STL to DXD. But last time I try that feature, the software crashed.
I will try to do it again when I’m back at the office.
Thank you for reminding me about this.
Quick update: opening the DXD file with a hex editor, there is a XML file defining the metadata of the current file and a public RSA 1024 key. I’ve been scouring around to find the private key with no success.
Hmmmm. Ideally that key is only being used for attestation/authentication, not encryption. In this case, you definitely don't want to locate the private key, because that key's confidentiality is what verifies the integrity of the scans made by your device.
Also, said private key might be specific to your copy of the software to create a chain of custody to your machine for medical purposes, or even more likely for licensing reasons.
In any case, if it's being used for encryption, that would amount to an unfortunate DRM situation that might be a bit of a hornet's nest to fiddle with, because of the high likeliness the key is being used for license enforcement etc (tracking scans made by copies of software deemed illegitimate etc).
It's very cool you can go from STL to DXD though. Now I'm curious, was the STL file that crashed the software originally generated from a DXD file created by the software? It originally being a DXD should be irrelevant, but chances are the pipeline inside the software chokes on things that aren't models of teeth. This does admittedly make the reverse engineering process trickier...
Thank you for your input. I forgot the mention in the original post that there is a tool called pdf-parser.py [1] which claims to be able to do that but it produces a broken output. I don’t know anything about Python or PDF internals to hack on it. Posting it here and hoping that the HN crowds could point me in the correct direction.
That's really a top tip! Thank you very much.
It looks like the original file is compressed using FlatDecode. Passing through mutool decompresses all streams and let's the parser does its job.
Great! Glad it worked. Happy to help you unpick things a bit further. When you look inside the pdf file you’ll see that it’s actually a “tree” of “things”. Each one starts with “obj 0 1234” (or something like that). And they reference each other to build the structure. So for example, the document is made of a list of pages. So that’s one object. And each page is another object. And then each page is made of a bunch more stuff and so on. Somewhere in there, no doubt, you’ll find an object that’s your model.
Any chance you could get me some dxd files? I'd love to take a stab at reverse-engineering this and writing a direct converter to something standard. Feel free to email them to me (email in bio).
Absolutely yes! I don't have any files that doesn't contains sensitive patient informations on my laptop at the moment but I will create new files when I'm back at work on Monday. I will email you when I have the files.
Interestingly enough one of my early jobs was pretty much the opposite, writing a u3d encoder from spec and then using a commercial c++ pdf manipulation library to inject them into pdf files.
I am sure there are libraires un python or JS nowadays, it’s just a question of parsing the tree to find the u3d node and dumping it out, very simple
HN is really the only place that you can ask a question and received answers from someone who actually worked on the problem before. And you're correct that all I need to do is find u3d node and dumping it out. See my response in parent thread about using pdf-parser.py.
I am considering animations for my thesis. When printing, a designated frame should be used, but inside a reader, the animation should work. I am writing my thesis in LaTeX too. Any pointers?
You probably better invest the time in the preparation of a couple of beautiful Jupyter notebooks. That's where people expect interactivity and code to happen, not in PDFs. In my scientific community, virtually nobody uses Adobe Reader (people on Mac use Preview.App, people on Linux use poppler/xpdf/evince, browsers have their own internal readers).
Edit/Appendum: Crafting an interactive website (i.e. without the dependency on jupyter) might be more future proof.
Thanks for the link! I think this is tightly connected to Donald Knuth's Literate programming paradigm, i.e. there are also platforms which generate beautiful reports out of the embedded comments in your traditional source codes. However, you won't get the interactivity for free.
I personally prefer Jupyter because it seperates the programming language (Python, R, C++, etc.) from the representation (for instance Web) and still allows interactivity for a certain degree (given a backend running the source code).
He didn't say anything about interactivity though. But even a lower bar than this : just animation, is not currently cleared by the available document formats.
(And a website doesn't fit the requirements as it's not contained in a single file, so its archival is a lot more complicated.)
Unfortunately my school expects a PDF thesis. But you make a good point about popular alternate readers not supporting animations - maybe this is a wasted effort. Thank you! Probably better to link to notebooks or videos of the animations on vimeo/youtube.
Yes, PDF is the standard. Some people get it managed to generate HTML out of their PDFLatex code. This could be a nice starting point for an enriched reading experience in the web browser. However, with standard print-first latex, I never could generate HTML for any nontrivial documents (especially large documents with many hundred pages).
Putting your animations in traditional video formats (mp4, ogg) or on vimeo/youtube is probably the best way to make them accessible for most people. Many scientific labs have their own youtube channels.
Many 3D authoring program like Blender, Meshmixer can output U3D or RPC that you can use to embed into 3D PDF files. There are just many tools that can read the format. But beware that only Adobe Reader can show the 3D object
Note that PDF 3D models have a static image (a bitmap) which readers that don't support 3D (most of them) will show instead. Actually Adobe Reader shows the static image too, until you click on it to activate the 3D rendered version.
I had a similar issue recently (for a much smaller project though). The sad reality is that it looks like that we currently don't have an actual, working, properly supported standard for electronic documents, which would include something as (relatively) basic as animation support :
https://news.ycombinator.com/item?id=25612066
PDF attachments are very useful for lossless steganography. Image-based techniques get lost in recompression (e.g. Save To Camera Roll on an iPhone, or sending via Facebook message). PDF attachments don't get lost in that way.
Want to include the CSV raw data with your report? Just add it as a PDF attachment.
Want to hide a game with your homework? Add it as a PDF attachment. Chrome and Preview on Mac doesn't show that it exists, but Firefox can be used to extract the file.
It's not going to shock anyone to have a 5 MB file as a PDF, but there's a lot you can hide in there (MP3s, games, HTML files including more JavaScript, whatever else).
On the surface, everyone thinks it's just another PDF. But the real data is hiding in plain sight.
It's not going to shock anyone to have a 5 MB file as a PDF
If I see a PDF containing a page or so of text and it turns out to be several MB, I would become a little suspicious. But you're right that most people are not aware of the general size of things.
With PDF, it would not be hard to obscure the presence of only a minimum of legitimate content using many page-breaks to give the illusion of a long document, and filling those pages with space-hogs such as large headers, tables, and algorithmically generated graphics. The sparsity of genuine content would not be too surprising given that PDF was originally intended for printing rather than reading.
Fair enough that it isn't steganography, though it can be used for similar applications.
Do "Most viewers" show it? Google Chrome doesn't, nor does Preview on a Mac. There is no easy way to add attachments to a PDF, except Adobe Acrobat Pro or iTextPDF. Firefox and Adobe Reader can read the attachments, but it's "hidden" to some degree, inside slide-out side menus. Certainly enough to avoid a casual glance.
They’re not, though. Just try extracting the document structure to e.g. power an accessibility system like a screen reader, and you rapidly find out that the text is an unstructured bag of characters and positions with no semantic information at all. No paragraphs, no marked headings, not even word boundaries. You have to attempt to infer from proximity and relative sizing.
I'm not convinced there's anything you can do about that without losing what makes PDF such a useful format. One of the great things about a PDF is that you can drop a few pieces of paper into a scanner and end up with a PDF in seconds. That wouldn't be possible if you had to care about the underlying markup, as you do when e.g. writing html.
Adobe does have tools for creating PDFs that are accessibility-friendly, but it can take hours of work. As much as it sucks for certain audiences, it just doesn't make sense to do that in the general case.
And fortunately we nowadays have validators that could be used to reject files with non-PDF/A features: https://verapdf.org/
Hypothetically a compliant reader is supposed to ignore non-PDF/A features encountered in files that declare themselves as PDF/A, so I've sometimes wondered if a cheap form of "sanitizing" PDFs would be to simply force their PDF/A flags on.
That is one shitty site. Trying to shove Google Analytics down my throat, no contact information, no privacy page. Probably illegal under GDPR.
> so I've sometimes wondered if a cheap form of "sanitizing" PDFs would be to simply force their PDF/A flags on.
That's not really how PDF-standards work. You'll have to "rewrite" the problematic parts, the standards are just for checking against the pre-defined ruleset.
In professional media production we do this "rewrite" all the time (PDF/X-standard). Though sometimes PDF files are just so "broken" that it's impossible to fix them.
Yes, I don't think it gets much attention - I should probably have pointed at the github org which is reasonably active. https://github.com/verapdf
> That's not really how PDF-standards work.
Well, it is how the standard works (don't make me dig out the relevant bit of what's publicly available from the standard) - the issue is whether common PDF readers actually do what they're "supposed to" or whether they just try and interpret as much as they can.
I worked on two aspects of this in my most recent position. I was responsible for implementing the javascript APIs and the feature of embedded abitrary compressed file attachments in a web based PDF editing SDK according to the lengthy pdf spec. It was an interesting technical challenge and eye-opening experience in terms of what I learned PDFs were capable of, and my immediate concern was some of the stuff this git repo talks about.
As a newbie developer I decided to use PostScript to generate badges for all our employees. There was a list of employee names in a text file, there was a PostScript file with the program and a Perl script to join them together.
The PostScript program would take the names, generate 8 badges per A4 page, scale the name of the employee so that it fits the space perfectly, generate procedural background, etc.
This is exactly what I hate about Adobe. They're always cramming way too much functionality into their plugins making them too heavy and riddled with security issues.
This is like flash player all over again. No way am I going to enable the proper pdf reader for web content view. There's a good reason browsers refuse to support all this
The underlying JavaScript-like language is ActionScript which was originally developed by MacroMedia to provide animation for Flash.
It's quite useful for creating PDF SmartForms that adjust their contents based on the user's responses. Until very recently they were only viable in the official Abobe Reader until Chrome decided to add support.
As far as security, blame Chrome for not incorporating an opt-in before allowing a particular PDF to run ActionScript in the browser.
This is one of the clearest examples of feature creep I've ever seen. PDF is, as the name clearly implies, a protocol for portable documents. Yet it has grown over the years to be a defacto form protocol, with capabilities to do way more than a portable document should.
we had a collection of these internally in the early 2000s using notes, even mandelbrot sets using embedded ps based fonts. a lot of this comes from dynamic form requirements. the JS engine was from the latest mozilla engine for the time when it came out, spidermonkey.
I've seen companies that use a fair amount of the PDF specification before. One of the most impressive was 3D models and scripted UI elements baked into the document. It kind of made the document look like JSCAD, with an actual 3D model you could manipulate.
I didn't expect this to be as 'Horrifying' as it was. Has anyone written a script yet to identify whether or not a given PDF contains executable script?
QubesOS has a "TrustedPDF converter" [0] that sanitizes a PDF to the extreme level for ultimate security - it converts the entire PDF to RGB pixmaps in an isolated virtual machine. The author has a blog post at [1]. Obviously you lose the ability to use the menu, search, copy or paste, but it's as 0day-proof as you can get for a horrifying file format.
The article links to a Google+ post for some reason and I can't find any other info. Isn't Foxit proprietary? I haven't used Chrome or Chromium in a while, but last time I did, they seemed to use the same PDF viewer. How could they include it in Chromium?
Thanks,
to be precise it's only the engine part of Foxit, the generator and renderer.
The foxit itself is still pretty much proprietary software.
From the last 7 weeks commit history, I see only people associated with chromium. So may be foxit is not involved in developing pdfium anymore, or may be they're not developing in open and sync in once in a while.
This doesn't make sense. A C compiler that has been compiled to javascript is still just a regular javascript program. It's not given special access to anything.
The JS based C compiler has no access outside the browser sandbox, but it is capable of generating actual executables which could potentially break out of the walled garden. That is why Microsoft ActiveX was deprecated as a security hazard and why their original proprietary browser was known as Internet Exploder.
Why go through the trouble of generating a blob of executable code at runtime rather than just including it in the JS source? The security guarantees are the same.
A long time ago I wrote my résumé in PostScript. The text was in an abstract representation, to which an internal typesetting system applied paragraph and page filling and converted it into drawing commands, and could output to plain text for emailing, and HTML, script to mail itself, etc. I thought for sure Adobe would give me a job, but I don't think anyone ever saw it, because who would ever look inside a such a thing? It became thouroughly irrelevant when everything became PDF. I'm not sure I even sent it to Adobe.
So if you do such a thing, realize you might only be doing it for your own enjoyment.
One of my good friends did a lot of research on PDFs as part of his graduate research. Older versions of Adobe Writer (maybe even the current one too?) would always append and never overwrite. So if you edited pages, it would add those edits to the bottom of the file. As long as you did everything in the Writer workflow and didn't Save As a new file, you could see a history of old edits. You can even find stuff that's blacked out in some government documents.
I cannot recommend qpdf [1] enough if you want to play around with PDFs.
Aside from being an excellent pdf manipulation library it also has a mode where it outputs a version of the pdf that is much easier to manipulate with a text editor and then lets you build a new pdf from that.
Shout out to Jay who has been steadily working on it for many many years. He is the most kind, undestanding and hard working free software developer I've had the pleasure to cross paths with. Thanks for all your hard work Jay!
not necessarily, there's nothing that says that old content is preserved in inaccessible streams... the entire PDF file can be re-written discarding all old content.
This is by design and not surprising at all if you read even a tiny bit about PDF. It's in fact the default save method in nearly every PDF capable software. Rewriting the PDF is in fact the less common method. I'm surprised a researcher of PDF would be surprised by that.
However, if you are using a tool like a redaction tool then the software should forbid you from writing in append mode. This was a common error in old PDF apps and perhaps contemporary ones that are new.
Edit for politeness:
My surprise is aimed at the researcher, not you :)
1. the one or ones being addressed
2. ONE sense 2a (which is “being one in particular”)
So, pro tip: in chat-like discussions with strangers such as hacker news, one should prefer saying “one” when using sense 2, even if it sounds a bit archaic (at least to me. Is it?)
Also, when reading a “you” that could be interpreted both ways, do not assume it is used in sense 1.
I also found a thread talking about searching PDFs for specific queries (https://news.ycombinator.com/item?id=10154527) which appears to have generated some interesting results back when the thread was posted, in 2015.
Not seeing anything recent though. But on the subject of a search engine specifically for finding redacted content, I couldn't help but imagine the discussion...
"Hi, I would like to find a •••••••."
"You specifically want a •••••••?"
"Yes, literally."
[Person 2 walks away scratching their head wondering what person 1 would do with a 'hunter2']
Not really, this type of save changes at the end used to be fairly common (i assume for performance reasons on big docs back when computers were much more constrained) microsoft word did the same thing back in the day.
It's not only common it's still the way PDFs are usually saved. Open a PDF in a text editor (PDFs are text not binary files) and you can see any edits appended as "trailers".
This sounds like a feature that could be exploited in creative ways in either a product or some fun side project. I don't know what it is exactly but there's something aout never editing edit logs (possible not being obvious to the user as a factor) plus some graphical UI representation or UX flow (besides undo).
I was told once by someone in infosec that the PDF spec included a dos emulator for some abstract thing.
That doesn’t appear to be true exactly but isn’t anywhere in the realm of impossible which is a serious issue for PDF.
I was hoping FoxIt dropped a lot of the BS spec parts, but it seems they don’t want to “lose out” to Acrobat in the features checklist. At least I know it’s easy to turn JS off by GPO with FoxIt, Acrobat I assume too?