Probably just takes tweets, canonicalizes them, and then hashes them based on a 26-length vector of character counts. For every new tweet, it looks for old tweets with the same character count.
I'm a bit surprised that there are anagrams to be found. It's easy to find them if they exist, but there's no guarantee at all that there actually should be collisions.
Anagrams are just sentences with the same letter counts. The anagrams they're posting have 25ish letters... how many ways are there to distribute 25 balls into 26 bins? (25+26)!/25!/26! is ~250 trillion. The birthday paradox square roots that down to ~10 million, and the fact that we prefer some bins (fewer Zs, more Es) probably cuts it down even further to ~1 million.
So one anagram per million short tweets; hundreds per day. Doesn't seem too unreasonable.
> What's surprising to me is the niceness of the found anagrams.
That's because they are manually curated [0]
Q: Is this manually curated?
A: Mostly for issues of volume ( there are a lot of variations
of 'goooood mooornnniinng!', there are a lot of spam bots
posting subtely different versions of the same message, etc)
the bot doesn't automatically post every anagram it finds.
Essentially there's an iphone client that reviews matches,
which are manually approved or rejected.
It's actually extremely likely. The chance that any two statuses are anagrams is miniscule, and even the chance that a particular status has an anagram among all other statuses is probably small, but the chances that there are no collisions at all is tiny.
See a description of the Birthday Paradox[1] for the mathematics behind this. For example, if you put 70 people in a room, there is a 99.9% chance that two people share a Birthday.
I find it interesting that they're manually approving the hits, because, as they indicate, most hits are (nearly) identical.
It shouldn't be too difficult to solve this automatically though. Identical hits can be discarded very easily. The ones that only have a few words or letters reversed can be detected with some kind of similarity algorithm.
I had a look at the source code, and it does quite a bit of filtering, particularly around making sure the words are unique, and there is a primitive character comparison algorithm.
The code could be simplified by using Python's set() and improved by doing a copy'n'paste on a Levenshtein function.
hey, author here: It's english only mostly because of volume, and because I review results. Making a french language version would mostly just be a matter of hosting. If you're interested let me know, I'd love to help you out.
This bot is pretty funny.
Anyone know how it works? I'm assuming it just sorts the string and puts it in a hashmap/table and looks for collisions.