I wrote a HN post about it as well: https://news.ycombinator.com/item?id=26105890, but to spare you all the irrelevant details and digging in the comments for updates - here is what worked for me - you can block all their IPs, even though they may have A LOT and can change them on each call:
1) I prepared a fake URL that no legitimate user will ever visit (like website_proxying_mine.com/search?search=proxy_mirroring_hacker_tag)
2) I loaded that URL like 30 thousand times
3) from my logs, I extracted all IPs that searched for "proxy_mirroring_hacker_tag" (which, from memory, was something like 4 or 5k unique IPs)
4) I blocked all of them
After doing the above, the offending domains were showing errors for 2-3 days and then they switched to something else and left me alone.
I still go back and check them every few months or so ...
P.S. My advice is to remove their URL from your post here. This will not help with search engines picking up their domain and ranking it with your content ...
Might I suggest a spin on this: instead of blocking the IPs, consider serving up different content to those IPs.
You could make a page that shames their domain name for stealing content. You could make a redirect page that redirects people to your website. Or you could make a page with absolutely disgusting content. I think it would discourage them from playing the cat and mouse game with you and fixing it by getting new IPs.
One possibility: Serve different content, but only if the user agent is a search engine scraper. Wait a bit to poison their search rankings, then block them.
Assuming you've monetized your content with ads, depending on your ads provider, this may have deleterious effects on your account with that provider, as they may then assume you're trying to game ads revenue.
zip bombs are files that when unzipped expand to enormous sizes. I'm not sure if OP put one to be downloaded for the offender to kill their disk space, or if you could stream one hoping the client browser/scraper would attempt to decompress and crash for memory or disk outages?
I think you missed the point - if people show up at $PROXY expect nice stuff but see junk, then they won't move over to $REAL and instead blame $REAL.
E.g. you'd like some way to redirect people from $PROXY site to $REAL site, and disgusting content on $PROXY won't do that - it'll reflect poorly on $REAL
It's a proxy, so there's no "crawler". It's just an agent relaying to the user. Passing something to this proxy agent just passes it directly to the user.
As soon as you have a few of their IPs, look them up on ipinfo.io/1.2.3.4 and you'll find they probably belong to a handful of hosting firms. You can get each firm's entire IP list on that page and add all of those CIDRs to your block list. Saves you needing to make 30K web requests.
In most countries in the western world, there are 3-4 major ISPs and this is where 99% of your legit traffic comes from. Regular people don't browse the web proxying via hosting centres as Cloudflare will treat them with suspicion on all the websites they protect.
Yes, constructed the honeypot URL using the proxy site and called it (thousands of times) so I can get them to fetch it from my server through their IP so I can log it.
They literally proxy your website? I thought they'd cache it... that makes more sense now in your statement that you hit their website with a specially formatted url. Since they pass that through to you you can filter on that.
Also: since you say 4k-5k IPs... any of them from cloud providers? And specific location?
There is also the potential to use it as a watering hole for more sophisticated or subversive measures where they subtly change what you post to promote something you don't actually promote (so at some point they deviate from pure proxy to mitm).
I didn't do it as a super quick burst, but in a space of multiple hours.
First because the proxy servers were super slow and second - I couldn't automate it - their servers had some kind of bot detection which would catch me calling the URLs through script.
Instead, I installed a browser extension which would automatically reload a browser tab after specified timeout (I've set it to 10 sec or something) and I opened like 50 tabs of the honeypot URL and left it there to reload for hours ...
Presumably they'll tell the copyright holder that sue them where they got it from, and provide evidence for that, and then the copyright holders will (also) sue the original source.
You could have a "stolen content" pure HTML/CSS banner that gets removed by Javascript. Only proxy site visitors will see the banner because the proxy deleted the Javascript.
some people like me will see the "stolen content" banner on the original website. And attackers can trivially remove it as soon as they get aware of it.
Thanks for the advice. I will give a go to some of these. p.s. I can't remove the URL as the post is not editable anymore. I'm just waking up... in Australia.
8chan like every forum ever has dumb moderators who dont know how to do their job / over extend their hand (and the moderation position of web forums seems to attract people with certain mental disorders that make them seek out perceived microinjustices which the definition thereof changes from day to day)
there were a bunch of sites mirroring 8chan to steal content
these were useful because they had both a simpler / lighter / better user interface (aside from images being missing), and posts / threads that were deleted would stay on the mirrors. being able to see deleted posts / threads was highly useful as the moderation on such sites tends to be utterly useless and the output of a random number generator. it was hilarious reading "zigforum" instead of "8chan" in all the posts as the mirror replaced certain words to thinly veil their operation. they even had a reply button that didnt seem to work or was just fake.
tl;dr the web is broken and only is good when "abused" by proxy/mirrors
Instead of blocking by IP, just check SERVER_NAME/HTTP_SERVER variables in your backend/web server (or even in JavaScript of the page check window.location.hostname) and in case those include anything but original hostname, redirect to the original website (or serve different content with a warning to the visitor). If you have apache2/nginx this can be easily achieved by creating a default virtualhost (which is not your website), and additionally creating explicitly your website virtualhost. Then the default virtualhost can have a proper redirect while serving any other hostname.
Those variables are populated by the browser, unless proxying server is rewring them, your web-server will be able to detect imposter and serve him/her with a redirect. If rewrites are indeed in place, then check in the frontend. Blocking by IP is the last option if nothing else works.
Presumably OP would only have to do this for a limited time, until the scammer gives up and moves on to an easier target. It's not the best, but I don't think it's as bad as you say.
most websites these days already use javascript, and all modern browsers already support it. unless you're some really niche turbonerd website, nobody is going to notice or care...
I've been surfing without javascript since 2015. Most websites continue to work fine without it (though some aesthetic breakage is pretty standard). About 25% of sites become unusable, usually due to some poorly implemented cookie consent popup. I don't feel like I'm missing out on anything by simply refusing to patronize these sites. I will selectively turn JS on in some specific cases where dynamic content is required to deliver the value prop.
Same, I even wrote a chrome extension to enable js on the current domain using a keyboard shortcut; but it has gotten to be more of a pain especially on landing pages.
In my entirely casual understanding of English most means the set that has more members than any other. When the comparison is binary (sites that work vs sites that don't) then "more than half" is both necessary and sufficient as a definition.
When comparing more than two options most could be significantly less than half (e.g. if I have two red balls, and one ball each of blue, purple, green, orange, pink, and yellow, then the color I have the most of is red, despite representing only one quarter of the total balls.)
That said, any attribute attaining more than half of the pie must be most.
In retrospect, the never in my previous comment was certainly an overstatement. While I agree with your reasoning, there is often a distinction between technically correct use of language, and what the hearer is likely to understand from what is said.
The other kind of problem is if the website is not really proxied but rather dumped, patched and re-served. In such case the only option (if JavaScript frontend redirect doesn't work) is blocking by IP the dumping server.
To identify IPs, as pointed in the root comment of this thread, you can create a one-pixel link to a dummy page, which dumping software would visit, but a human wouldn't. So you will see who visited that specific page and block those IPs for good.
I would think you'd want to be careful about search engines with that approach. Assuming the OP wants their site indexed, you could end up unintentionally blocking crawlers.
Yea if they're already rewriting content to serve ads (likely since they're probably not doing this for altruistic reasons) you're just putting off the inevitable. While blocking or captcha'ing source IPs is also a cat and mouse game it's much more effective for a longer period of time.
Maybe an html <meta> redirect tag that bounces through a tertiary domain before redirecting to your real one? If they noticed you were doing it they could mitigate it, but they might deem it too much effort and just go away.
You might also start with the hypothesis that they're using regex for JS removal and try various script injection tricks...
I wrote a HN post about it as well: https://news.ycombinator.com/item?id=26105890, but to spare you all the irrelevant details and digging in the comments for updates - here is what worked for me - you can block all their IPs, even though they may have A LOT and can change them on each call:
1) I prepared a fake URL that no legitimate user will ever visit (like website_proxying_mine.com/search?search=proxy_mirroring_hacker_tag)
2) I loaded that URL like 30 thousand times
3) from my logs, I extracted all IPs that searched for "proxy_mirroring_hacker_tag" (which, from memory, was something like 4 or 5k unique IPs)
4) I blocked all of them
After doing the above, the offending domains were showing errors for 2-3 days and then they switched to something else and left me alone.
I still go back and check them every few months or so ...
P.S. My advice is to remove their URL from your post here. This will not help with search engines picking up their domain and ranking it with your content ...