Jerks on the Internet: what my first DDoS taught me

Animats · on April 4, 2019

If you have some kind of expensive request, use fair queuing by IP address. If someone has a request pending, more requests from the same source go behind IP addresses with fewer requests. So each IP address competes with itself, not others.

For some reason, this isn't done much. I have it on a site of mine. I didn't notice for a week that someone was making a huge number of requests and not even waiting for the task to complete. It didn't hurt anything.

veryworried · on April 4, 2019

It’s not used because any serious attack is going to come from multiple unrelated sources, think a botnet full of compromised IoT devices hitting your server with 20TB a second worth of requests. So you might as well plan for that scenario instead.

joosters · on April 4, 2019

That's an entirely different kind of attack. If your server is being flooded with 20TB/s of traffic, there's nothing you can do on the box itself to fix things. Whatever you do, the legitimate requests won't be able to get through.

If however your DOS attack is an attacker making lower-volume CPU-expensive requests on your site, there's plenty of things to help mitigate the assault.

cheschire · on April 4, 2019

That's the key difference between a DOS and a DDOS, yes. Since I can't load the OP article on this machine for some reason, I can't review the symptoms described. The title of the article does say DDOS, however, which is focused more on saturating bandwidth, not CPU.

chias · on April 4, 2019

The headline and article itself do indeed (incorrectly) call it a DDOS attack. The attack itself was very much non-distributed, and involved a single malicious actor `curl`ing an expensive API endpoint in a loop.

giancarlostoro · on April 4, 2019

> The title of the article does say DDOS, however, which is focused more on saturating bandwidth, not CPU.

Sadly DDoS has at times been used as a blanket term that also includes DoS.

Animats · on April 4, 2019

Those requests seldom get far enough to start significant server activity. It's the ones that look like legit requests that are the problem.

hombre_fatal · on April 4, 2019

Only a small percentage of a volumetric attack has to get through to take you down. Also, depending on the attack, they probably all are "legit requests."

Animats · on April 4, 2019

If the source IP is fake, the request can't get beyond the first packet. Those get filtered out easily. That's Cloudflare's main offering.

veryworried · on April 4, 2019

Seldom isn’t good enough. When it comes to security you have to be right 100% of the time. An attacker only has to be right once. Good luck.

montenegrohugo · on April 4, 2019

That's not true. Security is always a trade-off between effort invested and probability of a possible breach.

There is no 100% secure system.

Simple_Guy · on April 4, 2019

This is not true though. Security is about mitigating threat at some cost. There are some threats you can't mitigate cost-effectively. Some are unmitigable at all.

reshie · on April 4, 2019

its about layers which includes real security and sometimes trickery. the statement is true though about 100% and the attacker only once but some tack maybe would be in order. there are procedures that can reduce vectors and so minimizing damage and minimizing successes.

samoa42 · on April 4, 2019

true, fq wont solve the problem completely. but it will solve the elphant- vs. mice-flow problem i.e. not make your interactive traffic suffer queueing latency from a simultaneous big transfer.

DaniloDias · on April 4, 2019

Has anyone seen a 20TB attack?

jrace · on April 4, 2019

The biggest DDoS attack to date took place in February of 2018. This attack targeted GitHub, a popular online code management service used by millions of developers. At its peak, this attack saw incoming traffic at a rate of 1.3 terabytes per second (Tbps), sending packets at a rate of 126.9 million per second.

https://www.cloudflare.com/learning/ddos/famous-ddos-attacks...

milesvp · on April 4, 2019

Never mind that the units would probably 20tb/s, it wouldn’t be impossible. We saturated our gig ethernet with 8 image upload workers from a flask app. We could have saturated a 20tb line (if such a thing existed) with 160000 workers. I’ve seen bot nets for rent with something like 30k bots. Which means you might be able rent enough machines with enough bandwidth to saturate 20tb/s for maybe $1k/hr. multiply by 8 if you really care about 20TB/s. I do think 20TB starts getting to be big players though, especially since anyone with that much bandwidth is going to have teams dedicated to mitigating these kinds of problems.

iRobbery · on April 6, 2019

And even if you havent had time or knowledge to setup fair queuing, this attack was very light, and an awk logparsing script that append ips to a firewall would have been more then sufficient.

Certainly no need for cloudflare

kikoreis · on April 4, 2019

How have you set it up, technology-wise?

jmchuster · on April 4, 2019

Hope 10% of this might be useful to you.

1) the three most important metrics for any endpoint are error rate, lantency, throughput. So i hope you've learned to not be surprised that abnormal throughput (either via ddos attacks or friendly n+1 queries) is a common error condition.

2) banning ip addresses is useless and often counter productive. If possible, short-circuit requests from an ip so you can isolate them and then determine how much damaging knowledge they know (while letting them think they're still hitting your real service)

3) feature development is a great goal, but don't forget that the most important feature is availability. Spending an extra 30 minutes to consider things like pagination and so on, end up being worth it if you think you might get attacked more than 0 times per year.

kristiandupont · on April 4, 2019

>short-circuit requests from an ip

What does that mean?

zoul · on April 4, 2019

I’d guess serving them a static placeholder or cached result to prevent them from hammering the DB?

Topgamer7 · on April 4, 2019

In an electrical circuit you have the defined path a current is supposed to take. A short circuit is when the current takes a shorter path to the ground. So A->B is the normal route a request would make. Short circuiting in this regard means A->C for that one ip. The goal is to reduce load on your server (IE you return a page that the attacker think is still valid, but has reduced load on your infra).

yoongkang · on April 4, 2019

Good steps to take in the article. But I'd also add that Django REST Framework, which they seem to be using, has throttling capabilities built-in which I would have attempted before changing the API: https://www.django-rest-framework.org/api-guide/throttling/

Adding pagination seems reasonable, but it may have broken clients who didn't expect pagination to be there.

sergiomattei · on April 4, 2019

Hi! Thanks for the recommendation, I did go with that approach

kuroguro · on April 4, 2019

Consider limiting HTTP access to Cloudflare's IP range. Looking up DNS history reveals the real IP address for direct attacks.

    curl -k https://134.209.46.107/products/ -H "Host: api.getmakerlog.com"

jakejarvis · on April 4, 2019

Seconded. Argo is also great and makes this even easier (for $5/month).

https://www.cloudflare.com/products/argo-tunnel/

dgudkov · on April 4, 2019

>With growth also come the assholes

That pretty much describes the whole history of internet.

jjgreen · on April 4, 2019

... the whole history of our species.

megablast · on April 4, 2019

> Therefore, when requesting the endpoint, a massive SQL request would be made, freezing the server while the items were fetched + serialized into JSON (a Django REST Framework performance weak point).

No caching?

adventured · on April 4, 2019

I was surprised to not see that in the things I will do / fix list. Caching those JSON API endpoints would have dramatically changed the ability to absorb that DDoS. If merely adding pagination brought it back to functioning (from 100% CPU to ~60%), it wasn't a very large attack and caching would have trivially handled it. Either way, even with Cloudflare and pagination, they should prioritize adding caching on the API at some point in the near-term. The relief on the database will be considerable and it'll buy a lot of API usage growth runway at almost no cost.

Since they're already using Nginx, if they don't want to bother with learning anything else, it's a couple of hours of research to learn how to set up rock solid basic caching using Nginx. It'll quickly get you 85% of the way on caching, until you need something better. Set Nginx loose to do one of the things it's very good at.

rlue · on April 4, 2019

Can anyone shed some light on why someone would go out of their way to conduct an attack like this? Is DoSing production web applications just a hobby for black hat jackasses with nothing better to do?

Topgamer7 · on April 4, 2019

Everyone has mentioned the destructive nature of some people. However I would also point out that some use a DoS as a means to cover up trails in logs and distract the admin from another type of attack. If you're paying attention to the DoS, you might not notice any logs or alerts about someone downloading 6GB of data from your database.

deadbunny · on April 4, 2019

Lumping DDoS in with hacking has always seemed disingenuous. skiddies DDoS things. Why? Who knows. Boredom? lulz? Perceived personal slight? "Justice"?

kuroguro · on April 4, 2019

Pretty sure it's jackasses with nothing better to do. Either for revenge or the lolz. Probably an impulsive decision since the attacker gave up so fast.

There's not much you can achieve for your own gains with a DDoS. I've heard of rare cases of extortion or underhanded business practices to hurt competitors.

devmunchies · on April 4, 2019

its a rush. its why some people with money still shoplift.

I guess humans are closer to monkeys than we like to believe and still like throwing poop at each other.

monsieurbanana · on April 4, 2019

I don't know much about "blackhats" but I know people, so the answer is yes.

tomglynch · on April 4, 2019

I helped build a website for a cryptocurrency. We were DDoSed with the attacked contacting us via telegram to demand around $10,000 for him to stop. We told him to stop acting like a fucking child and implemented Cloudflare.

TheOperator · on April 4, 2019

I've seen people with substantial IT skills who simply have destructive instincts. This one guy I know was hyperactive and made clearings in forested areas to hang out in with his spare time.

bregma · on April 4, 2019

In general, about 10% of people are just drizzleshits. Plain and simple.

derpherpsson · on April 4, 2019

It's fun. Or maybe you are angry. Or maybe you just want to test to break it down. Or maybe you just want the programmers to feel sad, maybe because they were boring

saiya-jin · on April 4, 2019

Fun in a same way as trashing your neighbor's car just because you like to see sparks and glass shards flying all around. Rather an indication of a sad frustrated life

heavenlyblue · on April 4, 2019

>> to see sparks and glass shards flying all around

Well, no. If you _only_ liked sparks and glass shards flying all around - you'd buy yourself a car and destroyed it yourself.

Destroying the car of a neighbour implies something going between you and the neighbour.

dangerface · on April 4, 2019

It's usually kids doing it for the same reason they graffiti or vandalise other peoples stuff.

paulie_a · on April 4, 2019

>which hosts other in-development apps too

Don't do that

>I trust my users completely.

Don't do that

>Prioritize bugfixes over new features

while that is a nice thought, it is unlikely to be so simply followed. "the road to hell is paved with good intentions"

>people editing other’s tasks for example, haha

That's a very cavelier attitude to take. Quite frankly that should have been baked in from the get go with tests to verify.

You see this situation as someone being a jerk. Someone could have accidentally done the same thing due to the lack of planning on multiple levels.

sergiomattei · on April 4, 2019

It was actually all unit tested - bugs happen though, and that one was a particularly nasty one.

It was patched though and there was no evidence that anyone ever used it.

eastdakota · on April 4, 2019

Glad we at Cloudflare could help!

eeeeeeeeeeeee · on April 4, 2019

I will also say Cloudflare has saved sites I’ve managed numerous times. I’ve been null routed by major data centers for a few Gbps and they wouldn’t give us any options. It was always “wait.”

Most of the DDoS providers in this space are insanely expensive so I’m very glad Cloudflare has existed!

iforgotpassword · on April 4, 2019

I hope another takeaway from this was to check at a lower level much sooner. At least the way it reads you spent quite some time suspecting your app being at fault or the tech stack goofing out. Htop should have shown high cpu usage from the db process right away, traffic was probably more than usual; access logs are always a good thing to check too.

lewilewilewi · on April 4, 2019

The lesson here appears to be that the balance between feature development and fixing technical debt isn't obvious - these things are measured historically and you only know that you've got it wrong after the fact. In fact, if you experience the exact same codebase and don't suffer the ddos, did you get the balance right after all?

MerlinW · on April 4, 2019

-A INPUT -p tcp --dport <port> -m limit --limit 25/minute --limit-burst 100 -j ACCEPT

pravda · on April 4, 2019

Why not just block all visitors with user agent curl?

Let me just say, that's a really dumb DDOS attack.

Edit: one other thing, we spell it psych and not sike.

astura · on April 4, 2019

>we spell it psych and not sike.

I have been informed by someone a generation younger than me that "the kids" are intentionally spelling it "sike" these days.

Wikionary lists it as a variant of "psych."

https://en.wiktionary.org/wiki/sike

nocman · on April 4, 2019

I guess that depends on who is doing the spelling. Seems to me that the 80's spelling of it was "sike". This website seems to agree:

http://www.inthe80s.com/glossary.shtml

pravda · on April 4, 2019

oh..gosh. well, that makes sense, in a way.

I confess I did have to google how to spell 'psych' properly. started from 'pysch'.

astura · on April 4, 2019

I always was sure it was spelled psych, since it's basically short for "psyched out" and I had always seen it spelled that way.

But the topic recently came up in conversation with my teenaged niece, she claims it's definitely spelled "sike" even though she's fully aware of the etymology.

I chocked it up to a "kids these days" generational kinda thing, like how we got "phat" in the 70s and "kewl" in the 90s, both of which are now in the Oxford English Dictionary.

kurtisc · on April 4, 2019

>I chocked it up

chalked it up ;)

keanebean86 · on April 4, 2019

I mostly learned how to spell it recently from looking up filming location for the show Psych.

taneq · on April 4, 2019

Spelling it sike is leet.

yyx · on April 4, 2019

Can anyone recommend monitoring solutions to help identify these issues?

jeffshek · on April 4, 2019

1. CloudFlare. That's like the first thing I normally do for any new project. (which he then switched to).

2. Your clients should not be the first one to alert your site is down. Pingdom does a great job to alert you before your customers/users do.

3. The author brought up some of the queries weren't paginated and running an expensive SQL query, so there's a few options. Since he's using Django there's some Django specific options in this list.

A) Implement a backend cache that will return back the JSON query (throw it on a redis). Cache and return that from the backend.

B) Add a Django throttle to the view (can be done via IP / username).

C) Enable logged in users only to access endpoint (harder to do on the fly though, since you need to make changes to your frontend). If a logged in user is causing you hell, turn off signups and kick that user off.

D) Have CloudFlare cache a public response for you on endpoints and return it (you need to make sure the JSON should always be the same for every API call though, which is very very risky).

E) Author brought up DRF JSON serialization is slow. Another alternative is to use Serpy which sees a 50-100x speedup. I'd only recommend that for complex JSON payloads. Not because it's hard, but because it's additional complexity.

The author is also using Dokku which is fine for most projects, but you'd imagine at some point it'll probably be switched onto a load balancer + web machines. Alerting can also be set on the load balancer level if it goes above % threshold.

Since he's using Dokku (so by that definition docker), they could probably use a log aggregation service that would allow him to access his logs much faster to see what's going on. Papertrail, etc.

Monitoring CPU usage would also be helpful here, but I'm not sure if Dokku allows that.

matt2000 · on April 4, 2019

Strongly recommend Cloudflare in combination with some uptime monitoring service. Cloudflare gives you so many options on ways to mitigate attacks, and you get CDN and other services for free. Pretty great.

emmanueloga_ · on April 4, 2019

I recently started reading the book "Release It", which includes a lot of great techniques to avoid problems like the one described in the article at "design time". [1]

1: https://pragprog.com/book/mnee/release-it

btrettel · on April 4, 2019

Website doesn't work with JS turned off. It's even worse: It's one of those sites that redirects you to another page to tell you to turn JS on but doesn't allow you to go back in your browser. So even if you do turn on JS you have to go get the link again. (Many scientific journals have the same problem but with cookies instead of JS.)

RHSeeger · on April 4, 2019

Why is the text for this article 2.5" wide in a normal browser? It makes it annoying to read :(

lphnull · on April 4, 2019

Was the site designed only for smart phones?

sergiomattei · on April 4, 2019

Thanks for the feedback! Will modify and enlarge the font a little bit.

jacobolus · on April 4, 2019

The font size is fine. The width of the text block is absolutely ridiculous.

http://webtypography.net/2.1.2

https://practicaltypography.com/line-length.html

jessaustin · on April 4, 2019

I opened the print dialog, and discovered that this short piece would take 65 pages to print. Yes, that is a sign that the width of the text block is absolutely ridiculous.

I couldn't figure out what's going on with my laptop, because the inspector short-circuited this goofy behavior. On a larger screen, I notice that the ".card-content" div has a ridiculous 150px of padding. It is nested in a ".card.blog-content-card" div, which has an atrocious 200px of margin. That in turn is nested in a ".blog-post-container.container" div, which has a merely unseemly 93px margin.

After 886px is used for white space, there's not much screen left for text. Might want to fix that?

RHSeeger · on April 4, 2019

Looks much better now, thanks for the update.

hyperjeff · on April 4, 2019

turning off #blog-post's padding and the .blog-content-card's left & right margins makes the layout nice on a desktop.

dltnhic · on April 15, 2019

Hello, this is a response to an earlier comment regarding Thorne/Blandford’s book. If a reading group exists by now, please let me know.

pmlnr · on April 4, 2019

Nginx rate limit the endpoint, log offenders, auto fail2ban.

Leo_Verto · on April 6, 2019

The UFW rule not having worked for you may have been Docker's fault.

If your gateway/webserver is running in a Docker container and you've published port 80/443, Docker will set up it's own IPTables rules, bypassing anything you've set up using UFW.

jamesmawm · on April 4, 2019

Good read. Some thoughts: - Add throttling at nginx level - Proactive monitoring and alerts needed - Should fool the attacker into querying a fake endpoint

Gnunix · on April 4, 2019

Hah. I remember the first time my server got DDoS'ed. I was scared shitless and I was stressing so much. Glad everything turned out OK in the end.

aytekin · on April 4, 2019

What doesn’t kill you makes you stronger.

jvln · on April 4, 2019

My version of the saying is - what does not kill you cripples you.

jeffwass · on April 4, 2019

“What doesn’t kill you makes you smaller” - Super Mario

csf333 · on April 5, 2019

get yourself a low orbit ion cannon & render all your enemies baseless

Zardoz84 · on April 4, 2019

this is ironic. I just having a DDOS from a botnet on China.

breakingcups · on April 5, 2019

That's not ironic.

fxfan · on April 4, 2019

Does anybody have experience in getting DDOS'd? All i see are 3 ip addresses (offending) in the screenshot and it makes me wonder how many is typical?

I have never been ddos's and all I ever receive are failed ssh attempts with simple passwords. Pretty much the easiest thing to tackle. But I'd love to know from DDOS'd people how their attacks looked?

From cloudflare logs all I see is a single IP address being blocked (multiple times? Or is it their multiple actions being blocked?).

EDIT: Thanks all, these were helpful answers.!

toast0 · on April 4, 2019

I've gotten a good number of DDoSes sent my way. For the ones I've noticed, there's usually pretty good IP diversity. Volumetric attacks are either like tcp syn floods, spoofed from everywhere, or udp reflection spoofed from you to reflecting hosts, which have pretty good diversity. If you want to survive these, you need to have either a big connection, or packet filtering by someone with a big connection. As of a few years ago, 10Gbps was enough to ignore casual attacks, as long as your IP stack is up for it -- you may need to do a bit of tuning and make sure you've got recent syn handling. On the other hand, if you're running a 10Gbps connection, be sure you're not a reflection target -- be extremely careful about running UDP servers that send significantly larger replies than the requests, if they're exposed on public ips.

Layer 7 attacks are different; you can't spoof those, so you don't get perfect distribution -- but there are lots of ways to distribute simple requests. If the requests are coming from a botnet, there's usually a lot of control about what the requests look like, but if they're coming in through tricking other software (which is unfortunately common), then at least you'll likely have some identifying information; it's dumb to block things by user-agent, but it can be pretty effective. The way to handle these is really to try to make sure the effort your server spends is roughly on par with the effort the client spends; and try to make sure you're running the best optimized TLS handshakes you can (ECC certs are easier on servers than RSA).

kqr · on April 4, 2019

I don't know where the line goes between DOS and DDOS, but in my youth I had a virtual server attacked by an acquaintance for a few days. They still won't admit to doing it, but all available evidence at the time pointed to them.

It started as a TCP SYN flood, which I had never seen before -- but since it originated from only two addresses, I could block them. Once I did, more addresses joined in on the attack. I figured out what was happening and how to prevent the connection table from filling up completely under the SYN flood, and then the attack changed shape into a UDP flood, from yet more addresses.

(Many of the addresses corresponded to free shell hosts and managed webhosts with easily exploitable PHP scripts. I tried to inform the people running those hosts that they were being used for an attack, but the vast majority of them seemed to ignore that. A couple of people responded and we could figure out kinda what had happened and what scripts were involved. Remember, kids, that connecting a machine to the internet is a great responsibility.)

dwd · on April 4, 2019

They tend to look like that in your Apache logs when they target at the web application level.

Once you're flooded and your process load spikes, you can't see what your web server has run out of workers to handle.

Netstat will give you a better picture as far as IPs/connections once you get the process load down so you can actually run anything.

jeffwass · on April 4, 2019

Anybody know why blocking the offending IP didn’t work?

Jeraz0l · on April 4, 2019

Most likely because he was trying to block the IP at a point where it was forwarded by some other service. At that point, the offending IP would only exist in the x-http-forwarded-for http header and not as the source address of the request.

sergiomattei · on April 4, 2019

Yup! This was it.

dangerface · on April 4, 2019

I have had this before similar situation to op they found a heavy page that hammered the database until it went down. Banning their ip at the firewall stopped them, they came back a few days latter using a bunch of proxies about 1000. I wrote a script like Fail2Ban that detected the ip and blocked it worked a treat.

I was also attacked by a single ip that sent 10gbit of nonsense to apache, I had to contact my isp to get them to block the ip down stream from me, if they had used a bot net I don't think I would have been able to stop it.

baolongtrann · on April 4, 2019

By definition, DDoS is Distributed. From my experience working on a Layer4 DDoS Protection solution, a typical case often range from 1000 to 100k flows.