Hacker News new | past | comments | ask | show | jobs | submit login
Ask YC: How come spammers are not attacking YC News?
27 points by adityakothadiya on Oct 31, 2008 | hide | past | favorite | 19 comments
I run a niche social news site in the part-time, and it has a very small community. It's growing slowly, but I'm fine with it. What I'm worried about is - how to control spammers by not submitting irrelevant stories on my site.

Adding CAPTCHA is one option, but then I was wondering, YC News also doesn't have any CAPTCHA protection. Then how come spammers don't submit advertisement based non-relevant news to YC News?

Does YC News algorithm detects such kind of links? Or is there any manual intervention? Or it's just that the community is so good that nobody attacks it.

In anyway, your input about how can tackle this situation will be very helpful. Currently I manually go and delete all those irrelevant submissions (Daily there are atleast 5-10 such submissions.)

-Aditya




They are. We currently get about 30-40 spam submissions a day. Turn on showdead in your profile and you'll see it all. The reason we don't get more is that we're very aggressive about killing spams. Most spammers give up eventually when they realize that submitting here generates near zero traffic.


Thanks PG for your advice.

BTW, when you say "aggressive about killing spams", you mean killing manually, right?


Some spam gets killed automatically. Some gets flagged either by filters or by users (there is a flag button on stories after you get over a certain karma), and killed manually by editors. It's very rare now for a spam not to at least get flagged.


I think there are algorithms in place that auto-flag spammy submissions to bring them to the attention of moderators more quickly, but as far as I know, actual deletions are manual.


30-40 a day? The OP has it right then. Reddit probably gets 30-40 a second.


Pretty much every site with user generated content is overwhelmed with people trying to post spam. For my site, I use a combination of javascript human detection, bayesian filtering, and aggressive human intervention (including single click "spam this" links on every piece of content when logged in as an Admin)

It's worth noting that since late 2007, a significant portion of comment spam is human powered. CAPTCHA style bot filtering doesn't work against it, since it's not bots doing the posting. Bayesian filtering and good moderation tools are essential these days.


Sounds like a good business opportunity.


The sites most vulnerable to spam are ones that a) have a critical mass of readership, especially dumb readership that will click on ridiculous spam links, and b) ones not run by people who are active contributors to the field of spam filtering.


i cant see how not being an active contributor to the field of spam filtering make your site vulnerable to spammers. Not be vigilant against spammers yes, but you don't need to be in the industry to combat this problem. the dumb readership comment speaks for itself, lets get off the high horse bro.


I think the point was merely that pg has spent a lot of time thinking about the problem of spam (it's one of the things he's famous for), he also wrote the software that runs HN, and when those two facts combine you end up with software that has many mechanisms for automatically preventing spam. It's just that being involved in the fight against spam means his site probably makes use of more cutting edge techniques than sites built by folks who have never dealt with spam before. HN probably also has a much higher "editor to submitter" ratio than most sites, and so a human that has privileges needed to kill spam usually sees it long before it hits the front page.


They're all afraid of Paul Graham.


I think a lot of it also has to do with the community itself. A site of this size would probably get 300-400 spam messages a day if it weren't for the fact that it's audience would see right through it. Tech people are so concious of Spam that they ignore it out of principle which means spamming a tech site pointless.

As for suggestions...

1. Obviously CAPTCHA. It just makes sense 2. I find keyword blocking very effective. So, for example, if I was running Hacker News I'd block any news item containing the word Viagra that was submitted by a user that is under a certain feedback level (like, no feedback, for example). With one caveat which is to give them a way to manually verify it (say an e-mail sent to them that allows them to verify they are an actual person and have the item approved) 3. Use E-Mail Spam Block Lists. Lists like SBL, CBL and XBL give IP addresses that generate massive amounts of spam. Many of those same IP addresses generate web spam. 4. I've never been a fan of this paticular method because I think it's discriminatory to an extent I'm uncomfortable with but many places have special requirements for countries that are famous for spam generation (Russia, China, etc...) Like making users from those IPs jump through special registration hoops.

Hope it Helps!


I don't see how captchas "just make sense", especially in the most common image-based incarnation. I have worked with visually impaired people and the most popular request was always "I want to something on this website, but they have a captcha I can't see (and occasionally an audio captcha that makes no sense), can you sign me up/comment for me/do whatever task?".

As a sighted person, I've even run across captchas that were impossible to decipher, both from some third party solution and from something like recaptcha, the latter which bothers me to no end because sometimes both words are ambiguous.

Whether or not they make sense depends on your audience and your site and your implementation of it.


Part 2 sounds powerful but it would make the submission process less simple and maybe less user friendly for new users.


Well you'd only do it on words that are almost certainly spam. Like Viagra or male impotence or...well, you get the picture. It works on the theory of "this word would almost never be used legitimately in a post so it's almost certainly spam"

I use this on my mail server and with 200 users I've yet to ever get a false positive.


> I use this on my mail server and with 200 users > I've yet to ever get a false positive.

How do you know? I don't see how you would measure that; if you can figure out it is a false positive, you have discovered a better filter. You might get user complaints, but the absence of user complaints doesn't prove you have no false positives. (Although the presence of user complains could prove that you do.)

Also: The assertion that everything to do with viagra is spam makes it very difficult to have a discussion about viagra or spam. For example, this posting would be rejected.


If you read my initial post I said specifically that it can't be just a flat out block. What you do is stop it and send an e-mail to the person who posted it asking them to verify they are an actual person.

That's both why it works even if you want to discuss viagra and how you can tell if you are getting too many false positives.


Thanks a bunch for your excellent suggestions. I'll look into them.


We have a problem with comment spam on our site (a news and prediction market site, using Drupal). We introduced captchas, activated nofollow, all to no avail -- there are some very persistent spammers who will still go through the trouble of entering captchas just to have their stupid links show up at the bottoms of comment threads. It's not a huge issue, but it's definitely an irritant and added cost, in terms of the staff time required to clear it out.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: