However, I wonder, shouldn't Twitter be able to pick these messages up automatically fairly fast, after (I assume) hundreds if not thousands of users have flagged them?
Also, the spammers can't have unlimited IP's. Twitters anti spam kinda seems to lag back behind E-Mail (subjectively).
Is there a reason the same techniques used in E-Mail aren't applicable to Twitter?
> Is there a reason the same techniques used in E-Mail aren't applicable to Twitter?
Twitter relies on very low latency - ie, once you tweet something, if it a whole minute to appear in your friends' timelines, it could already have lost much of its value.
Lots of spam reduction techniques introduce latency to levels that are unacceptable to Twitter's use case.
I'm not sure why it didn't catch these, but I can imagine why the same techniques aren't applicable in general.
I doubt if that's even true, because Twitter's tail latency on API calls is in the hundreds of seconds. They do not operate a low-latency service (and they don't operate a high volume service either, by my standards).
I think the real problem is not latency, but simply they don't have the signals needed to differentiate spam from not.
Speaking as a user, when someone sends me a reply, I expect to see it in seconds via the streaming API so I can have a short IM-style conversation, and I consider this a nice benefit to using Twitter (rather than, say, email), if not totally essential. Hundreds of seconds sounds bad, but if that were the norm, it would be a serious problem.
Sure, but couldn't they run something that cleans up already posted tweets? That wouldn't introduce any latency while posting but would still (eventually) get rid of them automatically and hopefully pretty fast.
If my twitter account was compromised and used to send spam, I think the way I'd discover this is by seeing the record of tweets sent by the false "me". If those are being culled, then how will I know that my account has been compromised?
So I can't speak for twitter, but I work on anti-spam at Facebook, and imagine the problems we face are relatively similar. It's worth noting that there's a constant barrage of people trying to send varying degrees of spam. It's not like there's An Attack all of a Sudden - just occasionally people close to the HN social network happen to be targeted by something and it's magnified by the media / hive mind local to us.
> shouldn't Twitter be able to pick these messages up automatically fairly fast
Theoretically, sure. As a human looking at an attack, it's usually pretty easy to pick out "obvious" attributes that they should have been able to catch. But when you're operating at a scale like us or Twitter, even stuff that looks like it's obviously-indicative-of-badness often has false-positives (posts flagged as spam that are not). The long tail of weird stuff that a billion users do can be pretty crazy.
At the same time, the "obvious" attributes of an attack are often very cheap for an attacker to change. Instead, we try to go after more expensive resources (domains, source IPs, etc).
> after (I assume) hundreds if not thousands of users have flagged them
Sadly, looking at flags of content is not a silver bullet. The signal is very sparse (a given spam post is rarely flagged), and nonspam posts are frequently flagged (religious and political speech are great examples - and they are the worst kind of false positive if you delete them as spam). These problems can be somewhat mitigated if you aggregate flags over a dimension that's expensive for the attacker (domain-posted, IP that posted the content, text shingles), but even then the recall isn't necessarily great and you could still catch e.g. controversial political domains.
> the spammers can't have unlimited IPs
True, though you can rent space on a botnet that has many, geographically-diverse, real-user IPs. Also, I imagine a significant chunk of posts to Twitter come from apps, many of which each use a single IP to post tons of content.
> Is there a reason the same techniques used in E-Mail aren't applicable to Twitter?
There's definitely some overlap. I'm not an expert at email anti-spam, but in general it's a relatively different problem. "Traditional" email spam is sent from some random email address on / via a compromised machine or open relay, and seems to be a relatively-well-solved. But it sounds like this twitter attack was caused by compromised accounts. At least anecdotally, it seems that email vendors are also not great at detecting this kind of attack. For example, my gmail account (with arguably the best spam protection in the industry?) gets a message every few weeks from some compromised friend's account. (i.e. someone had their email password stolen and the attacker is using it to "legitimately" send mail after authenticating to that email service with the correct password).
Could you not identify higher than normal viral scores and run some automatic checks on the links content to look for dodgy behavior (ie executing like / share links?) That would still end up with you being in a cat and mouse game of obscuring dodgyness but its a start.
Have you guys looked into sharing likelihoods of affected users? In other words I rarely share stuff I click on. If a higher than normal number of high view - low sharers like myself are sharing its either extremely popular or its spam. (Worth flagging for a manual check)
Yes, I admin a FB group with sockpuppeting spammers daily, and it took me a while to understand why the "report account" facility didn't offer a "spam" option.
Perhaps you can weight the flags of users. Like users who have flagged non-spam content in the past don't count, and people who have flagged lots of spam count more.
Yeah, we've played with that idea a bit. It doesn't help the sparseness problem (actually makes it worse), and if we took action as a direct result, it would give people the power to DoS content they disagree with.
For sites operating at a smaller scale, this could be a good way to surface content for manual review though.
Could you hire people just to review/flag stuff on a part-time basis, ala Mechanical Turk? Or is the problem just not big enough to warrant the time/money investment?
I imagine that while spam is annoying, it probably doesn't impact your bottom line in a big way.
I don't know. Part of the normal use case of Twitter is people mindlessly retweeting something or posting the exact same thing as everyone else. That's just trending stuff. With something thousands and thousands of people are sharing, it would take quite a few flagging it to raise any red flags.
Appears to have started on March 31st and has affected > 100K Tweets. It also appears to run between the hours of 2PM and 10PM PST, peaking at 7PM each day.
However, I wonder, shouldn't Twitter be able to pick these messages up automatically fairly fast, after (I assume) hundreds if not thousands of users have flagged them?
Also, the spammers can't have unlimited IP's. Twitters anti spam kinda seems to lag back behind E-Mail (subjectively).
Is there a reason the same techniques used in E-Mail aren't applicable to Twitter?