Hacker News new | past | comments | ask | show | jobs | submit login
Trolling the search engines (dam.io)
98 points by Buetol on March 16, 2014 | hide | past | favorite | 30 comments



This is a common grey/black hat SEO trick, basically create URL text that matches as many phrases as possible, and as demonstrated, easy to do with pretty simple cgi scripts and a bit of link seeding.

The theory behind it is that if your site has a lot of relevant "targets" then it must be more important than a site that has only a few targets. (Consider wikipedia as the poster child for this)

When people write naive web spiders in an attempt to create their own crawler, sites like this 'trap' them in an infinite web of apparently unique links. Always a good idea to stop after a few hundred and kick it back to a human to see what is up with that :-)


> Can’t believe that: No one did it before me (to my knowledge)

This has certainly been done before: http://en.wikipedia.org/wiki/Spider_trap


The wikipedia article doesn't mention how long this has practice existed, but I know this goes back to at least the late 90s.

Circa '98 IIRC, there was a module available for a webserver I worked with which generated a page with number of bogus email addresses per page, and a number of random urls per paged that when followed generated yet another page of bogus email addresses and links.

You hid the link somewhere on a legitimate page, added the base path as an exclude in robots.txt, and any mail harvesting spam-bots would get sucked in.

The idea may well have been around longer still.


Thanks, corrected it


I had the same "fun" idea 6 years ago. In 3 months Yahoo Site Explorer reported 150 million pages indexed. As of today Google still shows 55 million pages (in comparison, https://www.google.com/search?q=site:http://en.wikipedia.org... reports 34 million pages).

I had to kill the experiment (no more new "pages" crawled) because of the CPU load and bandwidth costs, even throttling robots


This is nothing new, anyone in the SEO world knows that this has been possible for years. It's not particularly long lasting as a strategy but it can be leveraged for things like link spam via massive amounts of cloaking sites.


I concur it is known - however, it is still impressive that google does not have a counter-measure against it. he says they indexed 140k pages of it. phew. consider having a few thousand of these pots set-up and cross linking to each other left and right. they'll be crippled... I sense a bug somewhere...


I think you're overestimating Google's capabilities a bit. How are they supposed to detect that the content is generated on the fly? If it's unique enough upon generation time, they really can't tell the difference between this site and a site with legitimate content on it.

This is the core of why Google uses things like links as a ranking signal. If they based it entirely on the content of the site, they're easily duped. While it's somewhat trivial to manipulate your backlink profile to increase your rankings, it's rather hard to fake links from known high quality sites like the New York Times for instance. So they can "trust" links more than they can trust the content of the site they're crawling.

So even though this site has a large amount of content indexed, the changes of it ranking for anything more than gibberish are so low that it doesn't even matter.


I tested this a while ago by auto-generating websites with millions of keywords. Surprisingly, I was able to get the websites ranked for many long tail keywords and it started to bring over 100,000 visitors per day.

Though it didn't last very long, Google penalized the websites after couple months. I did it for testing, but it could be an effective strategy for spammers who could rinse, repeat and scale.

There are many websites like that still ranking and generating traffic; some of them have Alexa rank below 1000.


I maintain web crawlers for a large internet portal and this stuff is a pain enough to deal with when people unintentionally make their websites recursive, much less INTENTIONALLY. facepalm At least google crawler had issues as well. Thanks a lot, troll. ;P


You need to update those crawlers to enforce some hard limits on crawl depth, methinks, especially for low pagerank domains (if there's some easy way to estimate pagerank, i.e.).


Creating an "infinite" website sounds kind of cool. You could generate all the content with a good language model. Then have humans correct where it makes mistakes. Maybe upvote or downvote pages that they like.


When you type in some querying strings the backend just fails, for example:

http://inf.demos.dam.io/sdfsdfs

I have yet to find a pattern, but I guess the author did some sort of hashing to the query string and use the hashcode as the seed to generates random text. For some given strings, the backend fail has to do with the use of the hashcode probably. (e.g. use [some_variable_or_constant_here]/[hoshcode - CONSTANT], when hoshcode == CONSTANT, the backend code does not contain the exception handling code.) Just a guess.


Makes sense, otherwise this is a very easy way for spiders to realize they're in a trap. I'm pretty sure Google tests bogus query strings to see what the 404 page looks like.


Some legit websites instead of displaying a 404 error, they redirect (3xx) to a page like the main page or about.


Or return 200 with an error message. Something Google would want to know in any cause.


As the rest have said, this is old news. Mass indexing is a simple hack in search. Especially for nonsensical content abusing non-standard terms.

If you want to impress us, rank a few 100k pages for competitive terms with garbage content.


One fun implementation of this is directory.io: A site listing every single possible Bitcoin private key and it's corresponding address. The theory is that eventually, millions of years from now, Google will have indexed all 10^74 pages and you can just google for a bitcoin address to steal its balance.

http://directory.io



I'm pretty sure this is done on a different scale for the sites that fake data to get traffic.


Just looking at the other results in Google: bestwordlist , zyzzyva and other generated websites.


There's a whole industry in Eeastern Europe that does just this and resells the link-space. It's nothing new but don't be surprised when your whole domain gets deindexed.


I am slow. Where is the recursion? There are a lot of hyperlinks on the page but I can't see the recursion? I am trying to find duplicates.


Yeah will probably be blacklisted soon.


How much traffic in visits?


I put the access logs at the bottom of the post, so you can calculate them

EDIT: Installed AWStats, just wait an hour: http://stats.demos.dam.io/

EDIT2: Can't manage to make AWStats to parse the old logs...


for those interested, I did look at it a bit and before I touched it was 116k~ lines. After I removed all the "bot", "spider"(but not "google" or "mozilla" or anything that might be used by real useragents)(and yes I am aware chrome doesnt use google and mozilla is used 99% of those strings, I just couldnt give any better examples) It went down to 1k~. I scrolled thru and most of those requests came from "216.151.137.36" so I removed that too. I had 500~ that wasn't "obviously" coming from bots. After this I stopped as what I wondered was "did this site get significant amount of real visits?" and the answer at that point was clearly no, or if any, less than 500.

Btw, does anyone know who this ip(216.151.137.36) belongs to? All I get is spam reports when I google it and doesnt seem to belong to any (major) spider. http://stopforumspam.com/ipcheck/216.151.137.36

Also on another question, how does google deal with web apps that have unlimited pages(dynamic urls based on get and so)? As in, how does it say "this is legit" and this site here is not. Surely backlinks are one thing, but these can be "faked", too?


> Btw, does anyone know who this ip(216.151.137.36) belongs to

FWIW, I see this ip hitting my spider trap too. 14 requests in 30 seconds on March 14th, and 21 requests in 45 seconds on Feb 20th.


$ whois 216.151.137.36

http://pastebin.com/XJawtmnw


wouldn't it be more effective to use your target keywords as the words list and create dozens or hundreds of slightly relevant pages vs. 148k?

am i wrong in thinking that might actually move SERPs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: