Hacker News new | past | comments | ask | show | jobs | submit login
A Technical and Cultural Assessment of the Mueller Report PDF (pdfa.org)
143 points by mpweiher on April 20, 2019 | hide | past | favorite | 26 comments



I get why they did this.

Document formats have gotten so complicated that you have no idea whether the redaction software you use actually does its job or doesn't. Going analog then back gives you a very good guarantee that you can't get otherwise.

In order for there to be guaranteed no leaks, the redaction software has to be bug-free. Leaks can be anything, from how much free space there is between allocated regions to highly precise layout placement information that you can use to figure out censored words on a trial & error basis if you have a copy of the used software. So you can't really come up with a watertight formal definition of leak-freedom, which makes proving that your software removes all leaks impossible, at least in rich-text documents. The only way I see is to go full ascii or something.


Yeah they keep saying "they could have easily used native digital redaction" but that clearly isn't as easy as they think because there have been numerous instances of people screwing that up.

Maybe those people used the wrong software or buggy software, but how do you know software is buggy? Much easier to print and scan than to dive into the PDF file on a really low level.

They could however have just converted it to PNG and then back to PDF digitally to keep the quality good. No need to physically print and scan it.


to be honest the analog redaction techniques are pretty bad.

For example there's a list of names, alphabetical. Two names are redacted.

Michael Cohen, Richard Gates, [REDACTED], Roger Stone, and [REDACTED] (newline) [REDACTED]. The final redaction could fit approximately three letters.

The fact that these are analog redactions makes it really easy to tell that the two other people are Kushner and Donald Jr.

Meanwhile if they had opened the report in Word and done [REDACTED] they wouldn't have these super-basic issues.

There _is_ an advantage in that you can have more confidence that the original document is merely being redacted and not completely changed, but it's not really beyond motivated people to actually change the source doc if they wanted to.


I think the redaction just allows them to say that it was redacted, not that anyone didn’t know who was in the list of names. Anybody paying attention to television news for the past 2 years would know who was in that list of names.


I wonder how different it is to carry a USB stick from SCIF to SCIF compared to just moving paper.

I doubt that top secret counterintelligence information in the report can be retracted in normal office space. Installed software in a SCIF may be highly limited and out of date.


SCIFs generally have some sort of TS/SCI network connectivity, so the appropriate solution would to just use it. But every agency that has SCIFs wants their own network, because it would be disastrous if TLA #1 could see TLA #2's cafeteria menus. Congress, the White House, all seventeen IC agencies, and every customer agency have at least one; and people on one don't necessarily have the access they need to others (or the cafeteria menus would be visible, and we can't have that). Given sufficiently pathological connectivity, it can be easier to just have someone courier a DVD.

There's absolutely no legitimate reason for a computer in a SCIF to have outdated software. Data diodes exist, and there is no technical obstacle to setting up a mirror of whatever package repository you like. (Political obstacles may be non-trivial, because compliance is far more important than security, and too many members of upper management believe that classified networks are somehow magically secure.)


Virtually every government office I have been in the past 5 years dealing with any classified information has disallowed the use of flash drives being connected to government machines, and it is pretty strictly enforced with reminder posters all over.


This document is pretty insightful for transfer procedures: https://www.gsa.gov/directives-library/procedures-for-the-us...


The gold standard for redaction is to replace the redacted text with nonsense of similar length. Otherwise, you retain precise metrics for the redacted text. With guesses or context you can fill those in with high probability. In an extreme case, the redaction might be a name of let's say a member of Congress, so the candidates can be narrowed down to a tiny number. There's considerable work on this, notably the "Declassification Engine" [0]. I believe it would be possible to apply even better word models (such as GPT-2) to improve the results even more.

I'm interested in whether this document was redacted in such a military-secure way, or whether black bars were simply placed over the text. I've reached out to a couple of news organizations offering my consulting help, but didn't get a nibble.

[0]: https://www.newyorker.com/tech/annals-of-technology/the-decl...


> With guesses or context you can fill those in with high probability. In an extreme case, the redaction might be a name of let's say a member of Congress

Something like this actually happened somewhere in the report. A name was at the end of a line, causing a part of it to break unto the next line. The few, black characters present on the second line lead readers to theorise that the name ends with “jr”.


Maybe Mueller figured out who killed JR?


Could a recurrent neural network predict what the redacted text looked like similar to how it can predict sentences from starts by training over newspapers and Wikipedia? Probably not useful like the above work, but it would be entertaining like Google Dream was.


What a beautifully geeky article.

I was wondering if they'd want to also print-then-scan the document out of a fear that redactions can be undone somehow? Stories in my memory about how older revisions of documents were still retained in the meta (or otherwise not immediately visible) data of documents (at the very least documents from Microsoft Office).


Yes, that's basically it. (Source: I've previously worked for the government.)


One of the major shortcomings of this simple redaction method, and not mentioned in the article, is actually from a security point of view.

This is less of an issue with Mueller report but was quite noticeable with Snowden leaks. If your redactions are of short words or group of words, you can sometimes make pretty good guesses as to what was redacted. There are all kind of methods you can employ, statistical etc., to aid with this.

A proper redaction tool could avoid this by varying the length of the redactions. PDFs even have layout hints now, so it’s not that complicated technically.

The copy-and-scan security method can be replicated by a flattening pass, that does the same, without losing any of the accessibility benefits.


The problem with that approach is that varying the length of the redacted text would effectively alter the document and forever muddy the waters as to what the original text could be. This would likely result in endless speculation that what might be released at a later date is not, in fact, the original text. It also makes it much more difficult to assess whether or not fighting to get one or more parts of redacted text released is worthwhile. (i.e. what appears to be a name or phrase in one area might not be deemed important vs a page and a half in another or vice versa)

So let's say Congress takes this to court and gets an order requiring the A.G. to disclose the contents of one or more sections/types of redactions. There would likely be little confidence that any future unredacted text they receive was the original text rather than yet another modified version of it (i.e. maybe one or more words would be added/removed etc.) By disclosing the exact length of the text, as imperfect a solution as that is, it makes it likely that any future alterations can be more easily detected.


I remember that someone posted about this on reddit, I can't find it right now, but if you look at the redactions, there is a part that is a list of names, the ones that have been convicted are not redacted, but the last one is, and the redaction look like: XXXXXXXXXXXX XX which makes it pretty obvious that the name is donald trump jr =)


> In releasing the redacted PDF of the report to the public, Barr avoids suspicion that the document had been edited (changed) in addition to straightforward redactions. PDF serves the need to unambiguously assure the press and the public that they are seeing Mueller's actual report.

This is such a shame. The PDF wasn't even signed.

Is an HTML file with a hash provided directly by Mueller/whoever is doing the redacting too much to ask?

> There's really no model for redaction of HTML-based web content.

Is replacing censored text with a fixed number of some character not good enough?


With either a PDF file or an HTML file, there's no way to prove that the document is a redacted version of the original, rather than an edited-and-redacted version of the original. I also don't see what using PDF adds here.


Anyone with the original document can eyeball individual pages and confirm that they (or black bars of the right size) appear in the redacted version. Doing that with unpaginated HTML requires you to use and understand some kind of diff tool, and if there are any significant differences (like a whole paragraph that disappeared without a trace), they can be plausibly blamed on 'technical difficulties.'


I think it's because many people think you can't edit PDFs, but they know you can edit text documents.


Even if you really can't edit PDFs, there's nothing preventing you from recreating the document and exporting that as a pdf. After all, it's all text.


A proper standard for redacted documents would be a format where the author provides a list of ((text, position), signature) entries and the redactor is free to remove elements of this list when redistributing it. (edit: to show the deletions, the author would have to provide a singed background shape as well where the text entries is cut out already)


Printers add small invisible yellow dots to pages to make them traceable. Is this a viable way to find out where these pages were printed?


Most of them are probably lost due to the heavy compression applied... but extracting yellow dots from scanned documents is possible in theory. E.g. it was most likely the reason why Reality Winner got arrested so quickly after she gave NSA documents to the press: https://en.wikipedia.org/wiki/Reality_Winner


Seems AI could be used to reconstruct a portion of the redacted sections of the document. There are already viable human translations proposed in this thread. . .




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: