Same thing happened to me and my service (https://next-episode.net) almost 2 yea...

bvinc · on Dec 12, 2022

Might I suggest a spin on this: instead of blocking the IPs, consider serving up different content to those IPs.

You could make a page that shames their domain name for stealing content. You could make a redirect page that redirects people to your website. Or you could make a page with absolutely disgusting content. I think it would discourage them from playing the cat and mouse game with you and fixing it by getting new IPs.

hedora · on Dec 12, 2022

One possibility: Serve different content, but only if the user agent is a search engine scraper. Wait a bit to poison their search rankings, then block them.

zhengyi13 · on Dec 12, 2022

... be careful with this.

Assuming you've monetized your content with ads, depending on your ads provider, this may have deleterious effects on your account with that provider, as they may then assume you're trying to game ads revenue.

chipsa · on Dec 12, 2022

The mirror is almost certainly running their own ads, given they strip the JavaScript out.

beirut_bootleg · on Dec 12, 2022

I've tried this with zip bombs, but I can't tell how well it worked out.

RektBoy · on Dec 13, 2022

Wait what? Care to follow on this hypothetical topic please?

michaelcampbell · on Dec 13, 2022

zip bombs are files that when unzipped expand to enormous sizes. I'm not sure if OP put one to be downloaded for the offender to kill their disk space, or if you could stream one hoping the client browser/scraper would attempt to decompress and crash for memory or disk outages?

That's my read on it anyway.

NicoJuicy · on Dec 12, 2022

Did the same things for spam bots :p

nomel · on Dec 12, 2022

> Or you could make a page with absolutely disgusting content.

Not if you value the people who might move to the real domain.

Mikealcl · on Dec 12, 2022

You could do this without effecting normal traffic depending on uniqueness of ip doing the scraping.

Love the idea.

swsieber · on Dec 12, 2022

I think you missed the point - if people show up at $PROXY expect nice stuff but see junk, then they won't move over to $REAL and instead blame $REAL.

E.g. you'd like some way to redirect people from $PROXY site to $REAL site, and disgusting content on $PROXY won't do that - it'll reflect poorly on $REAL

heelix · on Dec 12, 2022

If you can identify the crawler - you can provide 'dynamic' content for that specific user context.

nomel · on Dec 12, 2022

It's a proxy, so there's no "crawler". It's just an agent relaying to the user. Passing something to this proxy agent just passes it directly to the user.

antifa · on Dec 14, 2022

If those IPs are VPN services, you might be negatively affecting all VPN users in addition to the proxy.

sprior · on Dec 12, 2022

"Or you could make a page with absolutely disgusting content." You've never heard of Rule 34, have you...

walrus01 · on Dec 12, 2022

obviously somebody too young to have seen the method of using an http redirect to the goatse hello.jpg for unwanted requests

edit: or when somebody embed-links your image inside some forum, replace the original filename with the contents of hello.jpg

marklit · on Dec 12, 2022

As soon as you have a few of their IPs, look them up on ipinfo.io/1.2.3.4 and you'll find they probably belong to a handful of hosting firms. You can get each firm's entire IP list on that page and add all of those CIDRs to your block list. Saves you needing to make 30K web requests.

In most countries in the western world, there are 3-4 major ISPs and this is where 99% of your legit traffic comes from. Regular people don't browse the web proxying via hosting centres as Cloudflare will treat them with suspicion on all the websites they protect.

reincoder · on Dec 12, 2022

The site seems to be hosted on OVH cloud. OP should report this to them.

https://www.ovh.com/abuse/

Found the hosting information from here: https://host.io/us.to

KomoD · on Dec 12, 2022

Consider reaching out to Afraid.org first, https://freedns.afraid.org/contact/

They are the ones providing the subdomain

ElijahLynn · on Dec 12, 2022

THIS ^^

rexreed · on Dec 12, 2022

For 2) you mean you loaded it from the adversary's proxy site, just to clarify?

santah · on Dec 12, 2022

Yes, constructed the honeypot URL using the proxy site and called it (thousands of times) so I can get them to fetch it from my server through their IP so I can log it.

WirelessGigabit · on Dec 12, 2022

They literally proxy your website? I thought they'd cache it... that makes more sense now in your statement that you hit their website with a specially formatted url. Since they pass that through to you you can filter on that.

Also: since you say 4k-5k IPs... any of them from cloud providers? And specific location?

santah · on Dec 12, 2022

No cloud providers as far as I'm aware.

They were all from the same 4-5 ASN networks, all based in Russia.

adventured · on Dec 12, 2022

If you happen to use Cloudflare.... Cloudflare -> Firewall rules -> Russia JS Challenge (or block)

justsomehnguy · on Dec 12, 2022

Residential proxy botnet.

tofuahdude · on Dec 12, 2022

Why do they bother doing this domain proxy stuff in the first place?

justsomehnguy · on Dec 12, 2022

High quality content with a good standing in Google => unique and quality impressions => more revenue from the ads they insert in the content.

_siis · on Dec 13, 2022

There is also the potential to use it as a watering hole for more sophisticated or subversive measures where they subtly change what you post to promote something you don't actually promote (so at some point they deviate from pure proxy to mitm).

everybodyknows · on Dec 12, 2022

Also for (2), any worries that your own providers might imagine you're trying to mount some half-baked DOS campaign?

santah · on Dec 12, 2022

Wasn't really worried about that.

I didn't do it as a super quick burst, but in a space of multiple hours.

First because the proxy servers were super slow and second - I couldn't automate it - their servers had some kind of bot detection which would catch me calling the URLs through script.

Instead, I installed a browser extension which would automatically reload a browser tab after specified timeout (I've set it to 10 sec or something) and I opened like 50 tabs of the honeypot URL and left it there to reload for hours ...

RektBoy · on Dec 13, 2022

Look out as this is not optimal.

Since they will fingerprint your browser. But it looks like they were people with low IQ, so you were fine.

blinding-streak · on Dec 12, 2022

Side note: great idea for a website. This could be really helpful. You got a new user here.

mhlakhani · on Dec 12, 2022

I have to agree, my SO has been looking for something like this for a long time. Signing up today!

focusedone · on Dec 12, 2022

Wow, hadn't seen this before. Awesome site!

santah · on Dec 12, 2022

Thanks!

NullPrefix · on Dec 12, 2022

>4) I blocked all of them

Don't block them. Show dicks instead

otikik · on Dec 12, 2022

Once you have their IP addresses you can make them serve anything you want. Set your imagination free.

For starters: copyright-infringing material.

layer8 · on Dec 12, 2022

Unless you hold the necessary rights to the copyrighted material, that would make you a copyright infringer yourself.

martin_a · on Dec 12, 2022

How would they prove that when they label every content as it was their own?

layer8 · on Dec 13, 2022

Presumably they'll tell the copyright holder that sue them where they got it from, and provide evidence for that, and then the copyright holders will (also) sue the original source.

chris_wot · on Dec 12, 2022

Makes me wonder if you could switch serving content based on the URLs. So they redirect back to your website. Or display images marked as copyrighted.

santah · on Dec 12, 2022

I tried but couldn't redirect back to my website as they stripped / rewrote all JS.

rot13xor · on Dec 12, 2022

You could have a "stolen content" pure HTML/CSS banner that gets removed by Javascript. Only proxy site visitors will see the banner because the proxy deleted the Javascript.

dorgo · on Dec 12, 2022

some people like me will see the "stolen content" banner on the original website. And attackers can trivially remove it as soon as they get aware of it.

t0suj4 · on Dec 12, 2022

Would it be possible to hide a hash/encoded URL somewhere in JS and delete the site/redirect if the hash/encoded URL contained something unexpected?

stanislavb · on Dec 12, 2022

Thanks for the advice. I will give a go to some of these. p.s. I can't remove the URL as the post is not editable anymore. I'm just waking up... in Australia.

DoreenMichele · on Dec 12, 2022

The mod can though, if you email him at hn@ycombinator.com.

khiqxj · on Dec 12, 2022

8chan like every forum ever has dumb moderators who dont know how to do their job / over extend their hand (and the moderation position of web forums seems to attract people with certain mental disorders that make them seek out perceived microinjustices which the definition thereof changes from day to day)

there were a bunch of sites mirroring 8chan to steal content

these were useful because they had both a simpler / lighter / better user interface (aside from images being missing), and posts / threads that were deleted would stay on the mirrors. being able to see deleted posts / threads was highly useful as the moderation on such sites tends to be utterly useless and the output of a random number generator. it was hilarious reading "zigforum" instead of "8chan" in all the posts as the mirror replaced certain words to thinly veil their operation. they even had a reply button that didnt seem to work or was just fake.

tl;dr the web is broken and only is good when "abused" by proxy/mirrors

nuccy · on Dec 12, 2022

Instead of blocking by IP, just check SERVER_NAME/HTTP_SERVER variables in your backend/web server (or even in JavaScript of the page check window.location.hostname) and in case those include anything but original hostname, redirect to the original website (or serve different content with a warning to the visitor). If you have apache2/nginx this can be easily achieved by creating a default virtualhost (which is not your website), and additionally creating explicitly your website virtualhost. Then the default virtualhost can have a proper redirect while serving any other hostname.

Those variables are populated by the browser, unless proxying server is rewring them, your web-server will be able to detect imposter and serve him/her with a redirect. If rewrites are indeed in place, then check in the frontend. Blocking by IP is the last option if nothing else works.

michaelmior · on Dec 12, 2022

As the OP mentioned, JS is stripped and URLs are being written, so I doubt either of those approaches will work.

nuccy · on Dec 12, 2022

Making js essential is not that hard, right? Just "display: none" on the root element, which is removed by js :)

More sophisticated options can been found in other comments.

michaelmior · on Dec 12, 2022

Forcing all users of your website to use JavaScript to get around a scammer is pretty heavy-handed.

kelnos · on Dec 12, 2022

Presumably OP would only have to do this for a limited time, until the scammer gives up and moves on to an easier target. It's not the best, but I don't think it's as bad as you say.

psychphysic · on Dec 12, 2022

Just explain why in a way that vanishes with JS enabled. Like other have said it'll not need to be used for long.

ranger_danger · on Dec 25, 2022

most websites these days already use javascript, and all modern browsers already support it. unless you're some really niche turbonerd website, nobody is going to notice or care...

krater23 · on Dec 12, 2022

Show me one website that today really works without javascript.

jethro_tell · on Dec 12, 2022

https://news.ycombinator.com

c22 · on Dec 12, 2022

I've been surfing without javascript since 2015. Most websites continue to work fine without it (though some aesthetic breakage is pretty standard). About 25% of sites become unusable, usually due to some poorly implemented cookie consent popup. I don't feel like I'm missing out on anything by simply refusing to patronize these sites. I will selectively turn JS on in some specific cases where dynamic content is required to deliver the value prop.

cpleppert · on Dec 12, 2022

Same, I even wrote a chrome extension to enable js on the current domain using a keyboard shortcut; but it has gotten to be more of a pain especially on landing pages.

michaelmior · on Dec 12, 2022

> Most websites continue to work fine without it

> About 25% of sites become unusable

These two statements seem pretty contradictory. 75% feels like a low threshold for "most."

c22 · on Dec 12, 2022

Most is more than half.

michaelmior · on Dec 12, 2022

In casual conversation, I would never interpret most as being solely more than half. However, it seems like perhaps most people agree with you :)

https://english.stackexchange.com/questions/55920/is-most-eq...

c22 · on Dec 12, 2022

In my entirely casual understanding of English most means the set that has more members than any other. When the comparison is binary (sites that work vs sites that don't) then "more than half" is both necessary and sufficient as a definition.

When comparing more than two options most could be significantly less than half (e.g. if I have two red balls, and one ball each of blue, purple, green, orange, pink, and yellow, then the color I have the most of is red, despite representing only one quarter of the total balls.)

That said, any attribute attaining more than half of the pie must be most.

michaelmior · on Dec 12, 2022

In retrospect, the never in my previous comment was certainly an overstatement. While I agree with your reasoning, there is often a distinction between technically correct use of language, and what the hearer is likely to understand from what is said.

crooked-v · on Dec 12, 2022

Even JS-heavy websites are moving towards being usable without Javascript with server side rendering.

nuccy · on Dec 12, 2022

The other kind of problem is if the website is not really proxied but rather dumped, patched and re-served. In such case the only option (if JavaScript frontend redirect doesn't work) is blocking by IP the dumping server.

To identify IPs, as pointed in the root comment of this thread, you can create a one-pixel link to a dummy page, which dumping software would visit, but a human wouldn't. So you will see who visited that specific page and block those IPs for good.

michaelmior · on Dec 12, 2022

I would think you'd want to be careful about search engines with that approach. Assuming the OP wants their site indexed, you could end up unintentionally blocking crawlers.

tofuahdude · on Dec 12, 2022

Tail wagging the dog is never a good answer.

klyrs · on Dec 12, 2022

It's trivial to strip that "display: none" out, too.

brazzledazzle · on Dec 12, 2022

Yea if they're already rewriting content to serve ads (likely since they're probably not doing this for altruistic reasons) you're just putting off the inevitable. While blocking or captcha'ing source IPs is also a cat and mouse game it's much more effective for a longer period of time.

mnutt · on Dec 12, 2022

Maybe an html <meta> redirect tag that bounces through a tertiary domain before redirecting to your real one? If they noticed you were doing it they could mitigate it, but they might deem it too much effort and just go away.

You might also start with the hypothesis that they're using regex for JS removal and try various script injection tricks...

michaelmior · on Dec 12, 2022

If they're already stripping JS, I can't imagine it would be a lot of work to also remove the <meta> redirect.