So if I understand correctly this attacks HTTP compression instead of TLS compression - which is what makes it different from the previous CRIME attack.
This means there should be several important practical differences:
1) Things like cookies and other authorization headers are safe. They are not compressed according to the standard and cannot be sniffed in this attack.
2) Components of the URI are also safe.
However, it seems like anyone that uses form-encoded CSRF tokens might be in for a bad time. In fact, any guessable data in a request body is not safe (phone numbers, credit card numbers, SSN numbers, email addresses, etc). Also, I'm not sure what the implications are for SPDY/HTTP 2.0, which IIRC compress headers by default as well.
It's the same attack. CRIME works for TLS compression, SPDY header compression (yes it broke HTTP/2.0 even before it's called HTTP/2.0), and HTTP gzip response. BREACH was described in slide 39 in our presentation. We didn't test it because it's an application-dependent attack, and we couldn't find a nice target. We're happy that BREACH authors have found a very good target, and proved that the attack works.
CRIME is one of The Great crypto bug classes to be found in the last 10 years or so; it's a side channel leak based not on timing, error handling, or power consumption, but on pure traffic analysis. Seriously excellent work.
Importantly, and I think this a point Juliano and Thai undersold, the big issue with CRIME wasn't simply that it impacted TLS, but that it implicated an entire cryptographic implementation technique (one that I'd add Applied Cryptography recommends). The immediate feeling I got when I learned about CRIME was "this is going to take out a lot of systems"; it's a serious threat anywhere you have chosen plaintext and compression.
It's not surprising that there are other scenarios in TLS where CRIME works, but the big thing to learn from today's talk is that you should go out and look for other places that compress before encrypting.
It's possible, but it would require the compressor to be aware of locations of 'secret' data in its input.
More details:
The very common compression algorithm used by TLS and gzip is DEFLATE: an LZ77 transformation combined with Huffman coding. LZ77 turns a sequence of text into a sequence of literal instructions (output "abcd"), and copy instructions (output 5 bytes from 10 bytes back). Huffman coding lets you encode an alphabet with varying token probabilities with different length strings of bits-- think Morse code, where a common letter like E is one dot, but an uncommon letter like Z is dash-dash-dot-do.
CRIME detects the different ciphertext sizes that result when encoding a secret as literal data versus as a copy instruction.
There are a few tweaks that could be effective. Assuming the secret data is identified, it can be specifically coded so that it's coded as part of a literal -- not replaced by a copy instruction, and not used as reference data for any future copy instructions.
This could still leak some data, since a secret with more common letters might be Huffman coded to a shorter bitstring and detected. Luckily, DEFLATE has a way to indicate a "raw" block, which is just the literal data: 8 bytes of secret data would always take 8 + 5 (block header length) bytes to encode, and wouldn't be referenced by any future copy instructions.
How should a web app indicate to potential upstream compressors that some portion of its output is secret? A special HTML tag? A new HTTP Secret-Ranges header?
Sure, you can manually isolate secret information from attacker information. That wasn't really what I was looking for.
I was referring to a general compression scheme that could be applied pre-encryption and not leak info when combined with partial plaintext oracle attacks. I don't think such a thing is possible, but it would be awesome if some really smart researchers could prove me wrong.
Most compressors are adaptive: they adjust their state depending on previous inputs. A static compressor wouldn't leak anything-- for HTML, it might have preset dictionaries for common tags and attributes (instead of <div>, output a single token), and have a static Huffman encoding tuned to "average" HTML pages. The compression wouldn't be nearly as good as DEFLATE since it has to compress each chunk in isolation, but it would still beat plaintext.
My thought exactly: Is it now very dangerous to combine compression with encryption? Is there a general mitigation strategy or are the two mutually exclusive?
If we had fixed-ration compression, the attack would be impossible.
You could attempt to simulate that: after compression, add random padding so that it looks like a fixed 2:1 compression ratio.
This is wasteful if the data compressed better than 2:1. And if the data doesn't compress as well as 2:1, I'm not sure what you would do. (Maybe do no compression. But now you've leaked a tiny bit of info...)
Seems to me that if a compression scheme simply introduced a random size adjustment with each compression, it could defeat the subtle size measurements necessary for this attack to work. Might not even have to be very large random adjustments.
Suppose the target web server has an endpoint /foo?probeMe=bar
such that the HTTPS response will include 'bar' in the HTML. (Quite an assumption, sure.)
Suppose the target web server compresses its responses.
Suppose the attacker can make requests to the target web server, on behalf of the target user (e.g. when the target user is on an attacker-controlled webpage, and the attacker can make AJAX requests to the target web server).
In the case that the HTTP response already contains 'bar', and doesn't contain 'cbs', then a HTTP response to /foo?probeMe=bar will have a shorter length, than a HTTP response to /foo?probeMe=cbs , since compression will mean 'bar' is deduplicated.
Using this, the attacker is able to mount an Oracle attack. That is, if they know something of the form *@gmail.com , and they want to know the whole email address, they can make 26 probes, with probeMe set to:
a@gmail.com, b@gmail.com, ..., z@gmail.com
and whichever produces the shortest response is part of the response.
Suppose the shortest is the probe for probeMe=y@gmail.com . They try another letter:
ay@gmail.com, by@gmail.com, ..., zy@gmail.com . Again, one probe will have a shorter response than the rest.
They continue, until they find larry@gmail.com .
Now they know larry@gmail.com appears in the response. Success!
It seems to be <!-- small random string with random length --> in html sent by server, will be useful to protect from this attack. Length of response will be always different, and this difference can't be guessed by attacker.
Like adding jitter to try and disrupt timing attacks, this increases the cost to the attacker but does not actually prevent the attack. If you send a@gmail.com a couple of times, you will get an average length. No matter how much noise, send it enough times and you can get statistical confidence in the average length.
There's also the issue that adding random incompressable junk to your packets sort of defeats the point of compressing them in the first place.
I don't know if the browsers would tolerate this, but it should be possible to pad the compressed payload, rather than the source document.
A DEFLATE stream [0] is made of blocks whose first bit determines whether it is the last one.
One could add a number of random bytes after the last block such that the payload always has the same length, or is a multiple of a reasonably large number of bytes.
This assumes that the decoder will ignore it whatever follows the last block.
Another, cleaner option: Use a Trailer HTTP header, using Chunked transfer encoding[1].
Indeed. You could do what CRIME does to TLS (I just discovered these attacks).
You could also add a random amount of random padding. It would slow down the attack linearly if the random amount is taken from a uniform distribution.
I wonder if it would be possible to make it slower by taking another distribution.
This isn't going to be useful to malware writers. People who can not install malware on a machine may be able to use this to extract sensitive information.
The hackiest fix for this is to randomize the output of your content, or employ random padding. If the compression payload size changes every time (and if you can no longer assume the structure of the payload) you can't effectively determine if a guess was right or wrong.
Random padding increases the number of requests necessary by a couple orders of magnitude. It would be easy enough to filter out the noise if you tried each guess enough times.
It would certainly increase the number of requests necessary to a quantity that (for the vast majority of cases) would be well above any sane usage, and couple easily be rate limited.
Credit where it's due, StavrosK and I were discussing this earlier, and averaging out the noise was his idea.
What about deterministic padding? As a (badly thought through) example: Hash the plaintext, then use the first couple of bits of that result as your padding size.
This should counteract averaging of requests. On the other hand, an attacker might work around this by adding a unique token in addition to their guess...
Yeah, the important part is to randomize the output itself, in addition to padding. Even without the padding it makes for a significant challenge to determine just where your known plaintext is to compare from. But with enough attempts, if deflate eventually succeeds in shortening a message on a successful match, you could eventually find the matching message. Like you say, probably too long to be effective (depending on the number of random permutations and the size of the secret)
That's sounds like a very quick and easy fix. You would need to include a random string of random length in the responce before compression. If it wasn't random the 'padding' itself would just get compressed.
On a html page this could just be an automatically inserted comment after the /body tag. Although there may be attack against it if it is in a predictable location in the compressed output. Maybe it should be inserted somewhere random in the html?
The way you implement it will be subject to a myriad of application-specific statistical analysis. To my mind, the best way to look at it is like a jigsaw puzzle.
If you just took out or added a couple pieces, the whole of the jigsaw remains the same, and it can be trivial to determine what the image is. But shuffle all the pieces into random positions each time and the picture is impossible to decipher.
Consider an HTML text that was 500KB, and included a social security number. You add some random padding in a couple places, but not much. There's still 499KB that the attacker probably already knows; he can probably filter out this familiar ciphertext until he's down to the ~1KB of ciphertext he's not positive about, and start drilling down from there. Don't put it past a clever hacker to find more techniques to discover the secret.
Randomize the whole thing as much as you can, and pray. (Or disable gzip compression for sensitive content)
Did anyone catch how they can insert things in your plaintext so it passes through the compression algorithm? It sounds like it can only guess things that are in the POST request to the server, only if they can write to it but cannot read it.
I was confused about that too. They don't make the request, they force you to make the request (by putting a bunch of them as pixels in an HTML email, or by getting you to trigger some JS). I'm still not clear on how they see the traffic -- I guess the attack relies on having a packet sniffer that can see the encrypted packets go by?
Jim "Doctor Doom" Brosowski, an extraordinary hacker and renowned flip-cup player, sits in a Starbucks sipping his grande caramel macchiato. His laptop is open to a Youtube clip of cats dancing to "Stayin' Alive", while in the background a curious text box displays an application waiting for input.
Suddenly there's text flashing in the window. A Starbucks customer has just attempted to log into Hotmail! In a microsecond, the hacker's application sends packets that inject javascript code into their browser window, instructing the browser to open an iframe.
First the javascript will query a page with the user's actual e-mail address. The hacker can't see the response over the network because it's encrypted, but he can see the size of the encrypted message.
Then the previously-injected javascript will poll the iframe repeatedly with a specific URL and part of an e-mail address. As responses are returned from a web page, they will either be bigger or smaller, depending on if the compression algorithm has seen that requested text before. So by comparing the size of the original response with the size of these guessed responses, Jim can guess what the "correct" value from the original request was, piece by piece.
Since this attack can be done "in 30 seconds", it should be practical enough to perform in an open area such as a wifi cafe. For more lengthy attacks you'd need a more dedicated place to listen to traffic and inject HTML.
Another method besides injecting HTML into their browser would be to send them some spam to click on with a malicious website. But you still need to be able to listen to their responses, so it is not an attack everyone will be able to use. (But then again, that's the only time SSL/TLS really matters: against man-in-the-middle attacks)
Thanks for the explanation. Please excuse my ignorance, but I want to understand something: the 'html injection' phase would be done by some sort of response spoofing? Also, given that I can write arbitrary javascript to their window, can't I just do something like form.onsubmit(function(){ //send credentials to my private server }); ?
Injection typically involves sniffing their connection and determining the right bits of the connection to be able to spoof your own fake response. It can be done in-between an existing connection or at the start of one.
Browsers have developed all kinds of protections to prevent different kinds of attack. In general, if your cookies have been set with the HttpOnly flag, malicious javascript can not submit someone's cookies to any site other than the original domain (as far as the browser can tell). You can't even view the cookies in javascript to be able to submit them somehow else.
But HTML forms on plaintext pages can be spoofed or injected to submit them to a fake server using SSL or to the real server without SSL, making it easy to view the form data and the server's response. If the site doesn't set the "secure" token on the cookie, existing session credentials can be viewed as well without needing to observe or force the user to log in. This is how the sslstrip program works, and why many people look to new technologies like HSTS to better protect users against these attacks.
The SSL attack in the OP could be used against a site that employed HSTS to capture data from the body of a response if sslstrip failed. If you just want cookie data, the original CRIME attack would be better as it technically works on the headers AND the body, but CRIME depends on TLS compression being enabled, while this attack uses the compression of the body (that every website uses).
>Also, given that I can write arbitrary javascript to their window, can't I just do something like form.onsubmit(function(){ //send credentials to my private server }); ?
You didn't address this question of his, it seems. I would agree that in the scenario you're explaining, it would make far more sense for an attacker to simply modify the non-HTTPS landing page in such a way that the user hands his credentials in plaintext straight to the attacker.
From what I can tell, for BREACH, you don't actually need to inject Javascript into any page belonging to the domain you want to extract secrets from; you simply need to force the user to load a hidden iframe that you control. That iframe will then make repeated GET requests to HTTPS endpoints (which does not violate the same origin policy, so this iframe can be hosted on any arbitrary domain).
Hotmail was a joke :) But almost all browsers try HTTP before they try HTTPS, which is where the attack commonly comes in (or you intercept their primary request, or sslstrip, etc). Exceptions are if they're Chrome and have a whitelisted set of URLs, or support HSTS, or explicitly specified by the user or a bookmark.
The point is, the attack is there and it works, even if it doesn't work in 100% of cases.
I think it's actually more limited than that ... they have to be able to modulate the response body through a request, which sounds really f'in difficult to do with a properly designed web application.
I'm pretty sure they need a server that responds to POST form requests with user-specified unchecked data from that form, in addition to secret data the attacker actually wants. And the more I think about it, the more I wonder how existing CSRF protections wouldn't block that already.
I really wish I could find more details on this attack; I like Ars Technica for general news but the technical details are lacking here.
Most forms maintain the data that was entered by the user when there is a validation error. Say the attacker is after csrf token they could just use one of the fields in the form for entering their guess and it will be included in the responce.
You could do this more easily if the page included data from the GET query, something like a name or a search query or something like that, which gets echoed in the response.
For the attack to work you need to be able to eavesdrop on the ciphertext. Network protocols don't make any particular effort to prevent that, although the move from shared ethernet to switched ethernet has made it more difficult.
Arp spoofing works even over switched ethernet. With it, you can convince other machines that yours is their gateway, and then you can not only see, but modify all their network traffic.
No, I know that part, but I mean that this method can only sniff headers, pretty much, as long as they're being compressed. They own the body, since they're injecting data into it, the only thing they don't own is the headers, which is the only thing they can guess. If HTTPS compresses headers separately, or not at all, it's useless.
Yeah, so you'll have to be able to write to, but not read, the body of the request, observe the ciphertext on the wire and guess things in the body only. So this is only good for CSRF tokens and the like, and only if you can write to the plaintext you want to guess but not just outright read it.
That page suggests that guessing known text like an SSN or email address contained in the body would give them an opening to break the encryption entirely, but I don't know enough about crypto to know if that's actually a possibility.
I guess I thought they were guessing the POST data from a login form -but I see what you are saying. The POST request in URL encoded form or multi-part form data is not compressed anyway, only the response is affected by deflate. So I am curious too, how a password would be in the gzip dictionary, since it should never be sent back to the browser in compressed form.
I think the idea is to find e.g. a query string parameter that gets reflected back in the response's HTML along with a target secret. The attacker can then spawn requests and monitor the size of the response.
This sounds like the most plausible theory, though it makes the attack pretty limited. It could bypass CSRF protections, which don't work with GETs.
So it sounds like an attacker needs an endpoint that contains sensitive information and puts information from a request's query parameters into the response body. I imagine tons of applications do this, but it's nowhere near as far reaching as CRIME.
In many web application frameworks, CSRF tokens are session scoped. So an attack might look like this:
1. User logs in to target web app (e.g. hapless.com).
2. User visits your malicious content (e.g. evil.com).
3. Your page does a BREACH attack, issuing tons of HTTPS GET requests (by creating images in JavaScript, maybe) to hapless.com with query string parameters that manipulate the response.
4. You observe the responses and recover the CSRF token.
5. You attack the user with a standard CSRF attack on evil.com or elsewhere.
Note that this attack requires the user to interact with attacker-controlled content via a MITM'd connection.
> I imagine tons of applications do this, but it's nowhere near as far reaching as CRIME.
This is very common in web applications. Echoing query string parameters in the server response is the basis for reflected cross-site scripting, for example. And in a complex web application, tons of pages will have CSRF tokens embedded somewhere.
I do agree that CRIME is a much more dangerous attack.
Yeah, the more I think about this, the worse it is. It's certainly not as general as the original CRIME TLS exploit, but that almost makes it more insidious.
The big problem is that there's no blanket solution to this like there was with the TLS break. Then you could just turn off TLS compression, which wasn't a huge deal. Now, turning off HTTP compression is a much bigger problem. You're going to take a huge performance hit. The alternative is auditing every route in your application to ensure that it won't leak attacker info into a response - a very daunting proposition.
They create these POST requests by tricking the user into opening a malicious web page or email that they control. From that page they use JavaScript, I presume, to send a POST to the target website and monitor the packets by eavesdropping on the encrypted data upstream from the user.
So I suspect that having JavaScript disabled by default on untrusted domains would be a decent safeguard against this.
Cute attack -- takes advantage of the deflate compression shrinking the size of repetitive strings as a way to test if a given set of bytes is in the encrypted data.
Has nothing to do with email addresses. That was just an illustration.
I think it is definitely interesting, if you can test for a string for encoded data, either form encoded `&key={GOLD}&nextkey=` or json encoded `"key":"{GOLD}","nextkey"` you can get to almost anything.
Their other illustration wasn't terribly useful either, though, request_token. That's supposed to change each time the form is generated and meanwhile this attack takes thousands of runs against the server...
This means there should be several important practical differences:
1) Things like cookies and other authorization headers are safe. They are not compressed according to the standard and cannot be sniffed in this attack.
2) Components of the URI are also safe.
However, it seems like anyone that uses form-encoded CSRF tokens might be in for a bad time. In fact, any guessable data in a request body is not safe (phone numbers, credit card numbers, SSN numbers, email addresses, etc). Also, I'm not sure what the implications are for SPDY/HTTP 2.0, which IIRC compress headers by default as well.