It may not have been something which you wanted to do, but I think it is a really interesting problem, and I bet it has been rewarding for both business, and in a pure engineering sense.
In some ways, I think about what it must have been like to create a fake identity in a less connected age, and I wonder at how it will continue to evolve.
I recall some Doctorow novel in which spam and its increasing sophistication was almost an escalating arms race between our ability to distinguish authentic interactions versus those that were staged or generated / general sock puppetry.
I am curious about additional signals and information, I would presume in addition to fingerprinting and collecting as much information about each of their implicit touch points, did you find yourselves increasingly relying on more traditional manifestations of identity/reputation, etc.
edit: Or I wonder about a discount for new sign ups with a one time facebook scan & score type mechanism :D
Thanks again for sharing more information, good food for thought!
Agreed, I'd rather spend time building some cool features for developers everyone can use instead of confronting scammers trying to steal someone's else credit cards, so it's definitely the fight we chose.
Talking about traditional ip, domain and complaints reputation - it helps a lot to identify and block ignorant senders using some questionable techniques for getting their recipient lists, but it's pretty useless for fighting phishers - you need to act immediately and automatically, and reputation takes time to aggregate.
Really helpful information, thanks for including this.
I've been dealing with some non-email spam recently, and after reading this I count myself lucky -- most of the stuff I see is SEO related and they tend to come from distinct IP ranges and can be surfaced with some simple rules. I'm sure as time goes on, they will become more wily.
Yep, we were trying not to disclose too much information on how we catch them, however I agree that how we fight them deserves a separate post.
Some things to share:
* Naive approaches (hey, just plug in spam filter) don't work in most cases as spammers tune and create the new content specifically for our service
* Feedback (complaints) from customers is a great signal, but at the point you start receiving the complaints it may be too late.
* Bounce-based metrics (invalid addresses) are a great signal.
* There's no silver bullet as we've found, you have to collect as many signals as you can
* Rules based systems don't work as the rules change every day, you have to plug in some learning in place.
* Domain blacklists are also not very effective - as they use hijacked domains, or services providing free sub-domains to avoid blacklists.
* Ip blacklists are not very effective as well, as a lot of people are now using cloud services sharing the same NAtted ip.
* A lot of customers don't really realize they are spammers - "Hey, we've paid money for this mailing list, it's all fair"