Hacker News new | past | comments | ask | show | jobs | submit login
Cache Poisoned DoS Attack: Shutdown any CDN Website with One HTTP Request (cpdos.org)
283 points by ldmail on Oct 22, 2019 | hide | past | favorite | 88 comments



So after the authors disclosed this issue to AWS it was fixed and CloudFront no longer caches 400 Bad Request by default, also from the paper linked on the website [0]:

""" Amazon Web Services (AWS). We reported this issue to the AWSSecurity team. They confirmed the vulnerabilities on CloudFront. The AWS-Security team stopped caching error pages with the status code 400 Bad Request by default. However, they took over three months to fix our CPDoS reportings. Unfortunately, the overall disclosure process was characterized by a one-way communication. We periodically asked for the current state, without getting much information back from the AWS-Security team. They never contacted us to keep us up to date with the current process.

"""

[0] - https://cpdos.org/paper/Your_Cache_Has_Fallen__Cache_Poisone...


It seems like the unstated larger problem here (in the blog article at least, I haven't read the paper) is that cached pages are served that do not match the full HTTP request. The problem isn't just that error pages are being cached, but that certain HTTP headers that should change the resulting page from the server are being ignored when it comes to determining identical requests. Even if no one is maliciously using this feature to cause error pages to be paged, the cache is still breaking the website if it causes a different page to appear than would have been served directly from the server!


Actually, the origin is supposed to send a `Vary` header if it changes behavior based on any header.

So, if a client sends a 20kb `X-Oversized-Header`, when the server responds with a 400 -- it might be conceivable that it should include `Vary: X-Oversized-Header`.

Is that "really" the right fix? Probably not. But the HTTP RFCs provide `Vary` for exact this kind of reason within HTTP caching: the origin is varying its response based on a subset of headers.


In RFC-world (which may not be the same as the real world), there's no need for `Vary` on a 400 response, because a 400 response isn't cacheable (unless it has a cache-control or expires header).


I feel like HTTP response codes are an attempt to squish several layers of Result monads together into one single many-valued one, where each layer of the original nested set of Results has different caching semantics.

Like, in this case, a 400 is really the origin is saying that it's not even sending you a representation of the resource you requested, because you didn't make a request that can be parsed as any particular resource. It's a Left RequestFormatError instead of a Right CachableResourceRepresentation.

And, annoyingly, the codes don't have any line-up with which layer of the result went bad. 4XX is "client error", sure; but 404 isn't really an "error" at all, but an eminently cacheable representation of the representation of the non-existence of a resource.

It'd be neat to see the HTTP codes rearranged into layers by what caching semantics they require of the implementing UA, such that UAs could just attach behavior to status-code ranges. Maybe in HTTP/4?


Well, HTTP was probably not designed for doing well with caches. You certainly know that the first digit of a HTTP response code tells about some grouping, where jt is

    4xx (Client Error): The request contains bad syntax or cannot be fulfilled
I would state that, since this is client-specific, all 4xx responses should not be cached at some proxy/CDN, since he is not the client. And even the client should not cache a 404. A ressource could just be created the next moment.


HTTP was very much designed to work well with caches.


Could be, but there’s a reason to want cachable, proxyable errors here: they’re often very expensive for the origin.


Actualy, handling an error should be cheaper then handling a proper request. Just because an error most likely means an early exit of the handling server -- which means less time-to-answer, i.e. cheaper. (This does not cover any kind of DoS attack, which is always difficult to handle, regardless of an error or non-error answer)

However, effectively we agree with derefr, saying that HTTP status code design did not have this pecularity of cachable vs. non-cachable errors in mind. This is definetly a shortcoming.


The unstated larger problem is that HTTP is a content-delivery protocol that is (ab)used to also serve as an inter-process messaging protocol.

If messaging stuff was factored out or moved to be the primary protocol (with content delivery implemented on top) a lot of the issues we have with security, latency and caching would just disappear. And no, WebSockets (the way they work right now) are not going to solve this. Neither will QUIC aka HTTP3.


What would be the solution then?


Don't rely on HTTP codes for anything other than content-related status. The idea of REST (using HTTP codes and actions), in practice, has almost universally required customization/workarounds. I've never understood the insistence of REST as an API, fullstop. It's just one aspect of an API which you necessarily have to augment to leverage within a more robust protocol.


Maybe the cache's upstream requests should be normalized with the same routine used to normalize the request into a key to lookup in the cache.

Basically the problem stems from the cache implementation not being DRY.


Many intermediate systems such as proxies, load balancers, caches, and firewalls, however, do only support GET and POST. This means that HTTP requests with DELETE and PUT are simply blocked. To circumvent this restriction many REST-based APIs or web frameworks such as the Play Framework 1, provide headers such as X-HTTP-Method-Override, X-HTTP-Method or X-Method-Override for tunnel blocked HTTP methods

This is so f*ing scary. Who in their right mind invents such crazy tricks that absolutely circumvent all we know about web API security, implements them in frameworks, and leaves them enabled by default?


People who have to deal with end users behind overly restrictive corporate firewall. After you get a few hundred bad reviews telling you your website is broken, it looks very tempting to just fix it on your end.


They should at least put appropriate Vary header in the response, or make it non-cacheable altogether (you probably don't want to cache DELETE or POST anyway). Whoever implemented that functionality and didn't include that screwed up.


Maybe don't fix it and just accept that organization X doesn't want it's users to access your resource?

Seems more reasonable than trying circumvent a legitimate restriction. Not every block is an adversary which needs defeating.


Hah! I see that you've never been on the customer side of these issues. Try explaining "we can't fix cuz it's not our problem" or "we can't fix cuz our moral compass is just and we shouldn't do things that compromise security over usability"


How about; "It appears our services are blocked by some sort of company firewall or security policy on your end. This is an issue that you should bring up with your IT department. If you can advocate to get them to lift the block against our platform we can offer you XXX"


Guess you haven't worked with any large companies... They're always the ones that cause these kinds of problems, not the small flexible ones. Those large companies are also the ones that will pay the most to solve their problems via external services (both due to number of users and how slow it would be to deal with internally).

"That is managed by X department, we can put in a change request and we'll probably get a timeline for the fix at the beginning of next quarter. We'll re-evaluate if you can help with our issue once they get back to us."

... and you'll never hear from them again.


You have to balance it against expecting to receive: "That's funny, because we tried a similar product from Slightly Unjust Moral Compasses, LLC and it worked fine. We'll just switch to them. Goodbye."


What is the legitimate restriction? You'll let "users" POST but not PUT? In what Mordacian fever-dream is that reasonable?


The PUT and DELETE methods can't be used with HTML forms.

https://softwareengineering.stackexchange.com/q/114156


Forms aren't the only sources of HTTP requests.


Guess what? Browsers still work this way.



Businesses in fact, like money.


Right, and mine likes productivity. I'm sure the business that's expending resources to try and circumvent my firewalls has a firewall and a usage policy that their employees adhere too also.

So if you circumvent my original blocks, as an administrator, I will use my home-field advantage to just start targeting your individual users and remote resources. Suddenly your users can't even get on Google. Then they'll have to come to me for "the talk."

So the decision quickly becomes a) circumvent enough firewalls that you blacklist yourself or b) get all your users blacklisted for using your service.


You're assuming those blocks are there for a reason. But blocking PUT makes no sense. Getting blacklisted here requires a very special mix of incompetence and pettiness that is thankfully rare.


First of all, when using HTTPS isn't the method encrypted? That should prevent proxies from messing with things and make this less of an issue over time, I'd imagine.

But more to your point, one consideration is CORS performance. If you want to allow non "simple" requests and still avoid CORS preflight requests, you pretty much have to tunnel everything through POSTs, and handle cross-origin security on the server side. Yes this bad for security, but the spec folks haven't given us a performant alternative that I'm aware of.


> when using HTTPS isn't the method encrypted?

We're talking about caching proxies run by sites. These terminate SSL and work in the clear, because their whole job is to take the response one user got and efficiently distribute it to all the other users who want the same thing.


It's only "dangerous" because the cache server does not care about headers. I don't see how this could affect security without broken cache server in place. If you are using the framework you should know how the dispatching works. I agree it could be disabled by default tough.

It's just an information for the routing system to dispatch a different method on a controller. You could implement a way to use a query string to pass this override too. An API framework does not have to be RESTful. It could work with POST requests only and simulate deletes with something like ?method=delete.

Edit: I saw a GitHub comment that actually says it is possible to use query string to override the method in Play 1.

https://github.com/playframework/play1/issues/1300#issuecomm...

> We've found that although the header is disabled, its still possible to use X-HTTP-Method-Override by passing as a query string


This is why I've struggled to understand why CORS is implemented the way it is. It's so easy for developers to circumvent, but the solution is always hacky and/or semantically diluted, so you're worse off in the end.


Some platforms have really terrible HTTP apis. Your choices usually end up as:

a) don't support that platform and lose out on that audience

b) embed your own HTTP library with your client

c) do everything with POST and always return HTTP 200, because every http library supports that.

Or I guess

d) use C, but with headers to pretend that you're seeing an action other than POST.


Unity's built in web requests were like that a few years ago, they only supported GET (with no payload) or POST, so I and many other game platform developers had to implement (d). Most famously, Parse did it too.


It's been my job since 2005 to evaluate deployments of applications like these and I'm not sure I can call to mind any application I've ever seen that would have been insecure but for a perimeter filter on methods. URL-space, yes; I've seen things where "/admin" was perimeter-filtered. Methods, no: I've never seen an application where "PUT" was allowed on the internal network but perimeter-filtered so it wasn't allowed on the public network.

It's clearly happened, though; you can search for vulnerabilities tied to it. I just don't think it's all that common. Someone got a bounty from Google for a cloud.google.com service that used an internal proxy where XMO got them control over a PUT.


That doesn't make sense. Why would an ALB only support GET and POST and not the entire REST protocol?


Because it's old or has a whitelisted set of verbs that hasn't been updated since 1999.


Also a firewall usually operates at layer 4 not 7.


You are right many firewalls operate on layer 4, but there are also Web Application Firewalls like ModSecurity which inspect HTTP messages on Layer 7.


It's pretty surprising to see an issue like this. I've spent some time in the past tuning http caches/cdns. One key takeaway I recall is the importance of your "cache key". That is, which fields matter when deciding if requests match. If the cache key of two requests match, the requests are considered identical and thus the desired behavior would be to serve from cache.

Obviously, things like the request method, path and Host header matter a lot. Perhaps if you're A/B testing, the A/B cookie would make sense as part of the cache key too.

This seems like a simple misconfiguration at a very critical location. But an 'exploit' that warrants it's own domain? Hardly. This is a promotion for the authors and their upcoming presentation. It's a very nice gotcha.


This is a very easy mistake to make: you configure your cache keys in one place, and process requests in another. With nothing linking them it's not at all surprising for them to be out of sync.

As for whether it requires it's on domain, domains are cheap and publicity gets people to pay attention to problems and fix them.


A sane CDN doesn't cache error responses. There is no legitimate reason to cache a non 2xx/3xx response, unless you / your CDN is really pinching pennies.

Caching a 5xx kinda makes sense I guess, but a 4xx client error? That's nuts.


Caching a 404 can make sense, however.

For example, every browser requests /favicon.ico by convention. If you don't have one, you wouldn't want every single request reaching your backend.


Meh, it's subjective but I'd argue that's a bad idea. A 404 isn't exactly a heavy request; your origin should be able to handle plenty of those. The whole point of CDNs is to serve fat assets (images, videos, js libraries sometimes) not patch holes in your site or bandaid your raspberry-pi origin. Caching error responses, including 404s, can lead to all sorts of trouble and isn't worth it IMO.


Actually, generating heavy to process 404s is a fairly common attack, for some kinds of site.


Could you maybe provide a practical example of such a "heavy to process" 404? I'm having difficulty to image how a "Not Found" could (or ever should) involve heavy processing. AFAIK, a 404 should be only given for a non-existing resource (e.g. file). That should be straightforward enough. Granted, I've seen some horrible semantic resource pointers/paths in URI's over the years, some of which requiring processing, and some of them generating 404s. However, such contortions are mostly just a testament to horribly bad design. If the request involves processing on the server side (higher up than the web server or cache itself), resulting in a conclusion that a resource is unavailable, should that not return a 5xx response?


"We couldn't find /<requested-page>, were you maybe looking for one of these similar pages? <list>" is something that I've seen before


The origin server might be set up to serve everything through cgi. That would mean loading the entire framework just to spit out a 404 for the favicon. If all the other pages cache just fine, you could hit 100% load on favicons when you'd otherwise be at 10%.

It's not an ideal design but I wouldn't call it 'horrible'.


What is CGI in this context? I’ve only heard of it in a movie special effects context.


The web server having to run an external program to handle the request, typically because it's written in a scripting language.

Technically it stands for "common gateway interface" but that doesn't come up very much.



Apparently then many major CDNs (listed in the article table "CPDoS vulnerability overview") are not sane.


Based on their comparison chart it's literally just CloudFront (plus some edge cases if you're running IIS or ASP.NET on your origin with certain providers).


Ya in fact the CDN would either not cache error code pages or cache them separately from HTTP 200 responses


CDNs could whitelist HTTP header keys, constrain HTTP header values, and regenerate a full clean header to the web server as part of their security services.

CDNs already parse headers, and the CDN developers have the knowledge and know how to correctly constrain header values.

There are 100's of different HTTP Header attacks e.g. I like this one that returns a response from a different persons HTTP response into your HTTP response: https://portswigger.net/blog/http-desync-attacks-request-smu...


Well this is the thing I'm confused about, having worked on some cloudfront stuff last month, documents like https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope... gave me the clear impression that every header cloudfront passes to the origin will be considered a cache key. In particular if you configure it to "forward all headers", which is what would surely be needed to pass this custom "X-Oversized-Header" onwards, it would effectively disable all caching.

I'm clearly going to have to spend another day staring at the cloudfront documentation.


And why weren’t all headers considered in the cache key to begin with? The key computation will never not be O(1).

Only thing I can think of is too many cache entries with the same cache values for the case where requests with different headers produce the same API response.

Which could lead to a large cache.


Why would you have a CDN configured to cache error responses from the origin? It shouldn't serve those at all.

It's also a good idea to whitelist headers and query params your app uses.

Better put this sounds like an attack that only works on very poorly configured CDNs.


A main purpose of a cdn is to protect your resources by using caching. Error pages still require server-side resources to create. Not caching error pages gives attackers an easy way to consume your server's resources a la "DoS".


That is if a hacker can find an endpoint which is erroring, which they shouldn't be able to if you've configured your cdn not to serve errors from the origin in the first place. There's obviously other good reasons not to serve error messages out to the public.

Anyway, if you're an attacker it's easier to DDOS against urls that will 404 as those must go back origin anyway typically and likely won't be safe to cache for long even if the cdn is config'd to do so for 4xx responses. To protect against that your cdn provider probably has some sort of ddos protection feature as well though.


If the CDN doesnt serve an error, what pray tell should it return when the request triggers an error on the origin?


Depends what kind of error you're referring to but if it's a 5xx cdns can be setup to serve a static html. That html can contain as much or as little info about the error (including no info if you think the endpoint is likely to come under attack the page could be disguised as something legit).

Of course you should be monitoring for and logging these error responses from the origin and fixing them as soon as possible. The cdn response is just to provide cover. Again, that is if you need or want it. If you want to expose errors to the public go right ahead nobody is going to stop you.


In the text it points out that adhering to the standard regarding what may be cached - i.e. not most kinds of error responses - is a key mitigation, so it's pretty noteworthy that Amazon CloudFront™ seems to be by far the most afflicted CDN type. Also nice to see that Good Old Squid fares exceptionally well.


Maybe I'm missing something, but is the exploit somehow generating a cache miss? Otherwise everything that's already in cache isn't vulnerable to this right? Not that it makes it in any way less scary, but slightly more complicated at least..


Yeah, that’s part of it. The attacker wants to hit a url that isn’t already cached by CDN. That might be easy or hard depending on the site. Like if the CDN just has a 30 minute TTL, the attacker will need to be the first request right after the 30 minutes expires.


Why would a CDN cache a 400?


A CDN that doesn’t follow HTTP standards

“One of the main reasons for HHO and HMC CPDoS attacks lies in the fact that a vulnerable cache illicitly stores responses containing error codes such as 400 Bad Request by default. This is not allowed according to the HTTP standard.“


It seems that it should be feasible to cache more kinds of errors if the request that populated the cache and the subsequent request are identical. These attacks all rely on that not being the case. However, "identity" is a more slippery concept than most might think. Generally it requires putting requests into some canonical form, but defining that canonical form (especially what it excludes) requires making exactly the same kinds of distinctions that were missed to make these attacks possible. It just shifts the problem around, and introduces new potential for breakage. In the end it's no better than just following the darn standard, whose authors probably defined what was cacheable with exactly these concerns in mind.


Depends why you're using the CDN. If it's purely for geographic access times, then passing a 400 back to the origin is reasonable behaviour. If the CDN also offers DDoS protection, and a hacker can just trigger a 400 and always hit the origin, then your DDoS protection is near-useless.


400 means the request is malformed. If you want to block for geographic reasons, 400 is certainly not the appropriate HTTP status code.


I think you misunderstand me; some people don't care about the CDN capabilities as much as the caching and DDoS protection. If my web server takes 10 seconds to complete a request (yes, I've had really terrible journalistic sites that do that) and I can send malformed requests and always hit the customer's webserver, then I could get accidentally DoS'd, or intentionally DDoS'd with very little effort.


Some customers also believe they want this: CDNs are often about both performance (content closer to user) && reduced origin load. Caching a (too!) broad set of error codes “achieves this”, with the caveat of a poisoned cache.


Many CDNs cache error responses for a lesser period of time. For example, an image might be cached forever, but only for 5 minutes if it results in an error response.


An origin error (i.e. 5XX) is okay to be cached for a while. But a 400 is an error in the request so it should never be cached, ever. I guess that's the point of the article, I'm very surprised that CloudFront caches 400s.

404s could be cached, I think, but it's a risky one.


Theoretically if you cache based on the full header other than request time being identical, you could even get away with breaking the RFC and caching a 400 to save the backend generating it again. This specific problem only happens if you do all three of: 1. pass the troublesome header through, 2. cache the 400, 3. treat a new request to the same resource without the poisoned header as if it was the same as the one with the poisoned header.


You probably know this already, but if you cache 404s you should have a way to purge all cached 404 entries when you do a content push.


I don't understand the problem. If there's something wrong with the request, why expect a different response if the same request is sent again?


They're caching the resource based on URI only or on HTTP method and URI only is the problem.

If you're caching 400 (which many people have noted is against the RFC and can be troublesome) then you need to make sure you're only caching it for a matching request. If I send a poisoned header and you cache a 400 for my full request including that poisoned header, then that's one thing. If I send a poisoned header and you cache a 400 for all accesses to that URI whether or not they include the header that caused it to be a bad request, that's vulnerable to this BS.


The error could be (and I'd say, usually is) transient, i.e you hit a page and something is broken, you reload and then it's fixed.

Real-life example I'm familiar with: Content being served from a busy NAS that buckled under load. You'll have some requests time out with 504s, some return 500s, some that make the files appear missing and so you get 404s. I know, braindead design that shouldn't happen, but it does happen.


Is that really a legitimate attack vector?

I would expect web servers to just discard weird headers and serve the regular page. And, at the same time, I would also expect CDN to not cache 400 bad requests or 500 server errors.


Yes, it is. Many servers/proxies/caches/ad-hoc-header-filters configurations, if they're not handle properly, don't expect that some clients will tamper their headers or people who configure them don't even take them into account, so servers/proxies/caches/ad-hoc-header-filters don't discard them (better approach is to use white-lists but it's not the panacea). CDNs can cache or not, but I have seen a variety of behaviors even for the same CDN, so again I suspect many times this could be caused by misconfiguration.


I can think of only rare occasions where any status other than 200 is cached. Usually this is an edge case that has to be configured in the caching tier.


Not caching your error pages is a good way to bring down your servers once something goes wrong. We usually have a short cache for errors to make sure there is an upper limit to the amount of requests coming through the cache.


Redirects are often cached as well, but caching error pages for a short while is usually done to mitigate DoS attempts; if the cache can be repeatedly bypassed for certain resources, those resources may serve as targets.


One more mitigation idea: set non-trivial expiration times and "prime" your caches. When you update a resource, hit all your CDN endpoints (at least ones which don't follow spec and are known to be exploitable) to force them to download and cache the good copy. Bake it into your CI deployment process.


I usually do not let any header through the CDNs. I need to try this attack out on some of our infra, not sure if any of those are affected if you have tight header policy.


The title is hyperbolic but the article is clear and well researched. Issue seems to affect Cloud Front more than other CDNs - from the matrix provided




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: