I'm in the middle of writing a web-based anagram server and there's some similarities in both these problems. You have a dictionary that you iterate through recursively to find a solution. For anagram, the solution is usually found with 3-5 recursions (100 word list, 3 recursions: 100^3 calculations - not a big deal for desktop software, too slow for web-app). For Sling Blade Runner, your goal is to keep increasing the depth of the recursion as much as possible so it's actually impossible to go through 100^200 calculations.
Does anyone have any idea if these problems could be better tackled by converting them to map-reduce problems? I'd like to play with Amazon's mapreduce service.
trie = {
'dracula': {'': None,
'dead': {'and': {'loving': {'it': "Dracula: Dead and Loving It!"}}}},
'the': {'brides': {'of': {'dracula': "The Brides of Dracula"}}},
# ...
}
Then, for each word in each movie title, walk the trie from the beginning until you run out of matches (no match), or run out of words (at least one match).
Your anagram server is similar, but since they're not ordered like movie titles, you have to walk your scrambled word(s) and only generate permutations that have a matching trie path.
It's also quite possible to do it for phrases, not just words.
You can cheat a little by representing your anagram as a 26-key multiset containing the number of times each letter appears (in Python, use a dict; in C, use an array of 26 integers).
For each letter that appears in your anagram multiset one or more times, make a copy of the original multiset, and recursively descend the Dictionary trie, removing each matching letter from a new copy of the multiset (they're tiny, so don't worry). For each complete word you find, add it to a list of result candidates, along with the multiset representing the remaining letters in the anagrammatic phrase.
Repeat the process for each result candidate, growing your list into a tree. Stop when you run out of matches, or when you run out of letters in your multisets. (You can set a maximum number of leftover letters to accept, if you're having trouble matching a phrase).
If you test the letters in alphabetical order, your result phrases will already be sorted.
> Then, for each word in each movie title, walk the trie from the beginning until you run out of matches (no match), or run out of words (at least one match).
I think that's the challenge ITA wants you to solve - recursing 300 levels deep into a list of 1000 movies will literally take forever. As you can see with the 2nd linked article, brute-force just doesn't work.
My original algo for the multi-word anagrams was nearly identical to what you described. The problem was when you entered a long phrase. You can form 8000 unique words from the letters in "William Jefferson Clinton." Even when I removed duplicates words with the same characters (art = tar = rat), it was still 7000+ long. Add in another 'r' or 's' to the original phrase and you're over 10,000 unique words. If I'm allowing my algorithm to make 5 word anagrams, that's 10^5 calculations for a single web query - absolutely impossible.
And even if I do magically return the 100k possible results in 0.1s, users can't parse through 100k results to find the ONE anagram that's funny or creative. I do have an idea of what I want to do. Maybe in a week or so, I'll have a demo working.
I think that's the challenge ITA wants you to solve - recursing 300 levels deep into a list of 1000 movies will literally take forever. As you can see with the 2nd linked article, brute-force just doesn't work.
You're not recursing 300 levels deep; you abandon a subpath as soon as it doesn't match.
As for the anagrams, that sounds about right (if you install an anagram generator like wordplay, you can see that it generates a similar amount of matches).
You may want to devise a compact representation for a partly calculated results list, and just show n per page, with "prev" and "next" links that encode the original phrase and the first or last phrase on the current page.
When the user clicks "next", you run the algorithm as usual, but at each step of the trie search, you only search letters that are greater or equal to the current letter in the last known phrase.
You may also train a classifier like CRM114 to recognize "funny" and "boring" phrases, and omit the boring ones (except in your "prev" and "next" links!)
What do you mean by anagram solver? Find anagrams for a given word? Isn't that trivial by indexing words by the sorted version of the word? E.g. {"aep" -> ["ape", "pea"], ...}
I'm trying to built a better multi-word anagram solver. If you input a 32 character string, you'll end up with hundreds of thousands of potential anagrams and most of them are pretty useless. For example, "Florence Nightingale" returns 52k results on:
And most of them are pretty lame e.g. "A Fleecer Tingling Hon", "A Fleecer Longing Hint" etc. It'll take hours before you come across "Flit on, cheering angel" - I want to make this process faster/easier.
It's a fun project! I built one several years ago, after playing around with a bunch of different approaches. I wound up with a human-guided tool that seems to work pretty well for coming up with interesting ones:
Ah, interesting! This problem is NP-complete even in if you don't specifically want anagrams that are interesting to a human.
Suppose that you only have a single letter alphabet (say "a"). Now the input string can only contain a's, so it can be represented by a number n. The dictionary also consists only of words of a's, so the dictionary can be represented as a set of numbers D a subset of the natural numbers. Now the problem of finding an anagram is finding a subset of A of D such that sum(A) = n. But this is the subset sum problem, which is NP-complete!
How are you determining the interestingness of an anagram?
I agree, this is compelling for a lot of the same reasons as the anagrams problem. I spent weeks playing with anagram code at one point, and like you I felt an immediate compulsion to start writing code for Sling Blade Runner. So far my (ruby) code just does the obvious: builds the graph connecting up the movies and then uses brute force to find all chains of length two, then three, etc. Bogs down at chain length of about 11. Added some quick and dirty optimizations and pruning heuristics, and I'm geting chains in the mid-hundreds. I think this approach can be extended further, but isn't particularly clever :-/.
You can use map-reduce on your search by partitioning up the starting keyspace alphabetically (26/num nodes), although with such an even partitioning the search will probably be unbalanced. But with some simple metrics on the starting letter frequency of words you could define a more balanced partitioning.
* http://www.itasoftware.com/careers/puzzle_archive.html?catid...
* http://stuffthathappens.com/blog/2007/10/03/sling-blade-runn...
I'm in the middle of writing a web-based anagram server and there's some similarities in both these problems. You have a dictionary that you iterate through recursively to find a solution. For anagram, the solution is usually found with 3-5 recursions (100 word list, 3 recursions: 100^3 calculations - not a big deal for desktop software, too slow for web-app). For Sling Blade Runner, your goal is to keep increasing the depth of the recursion as much as possible so it's actually impossible to go through 100^200 calculations.
Does anyone have any idea if these problems could be better tackled by converting them to map-reduce problems? I'd like to play with Amazon's mapreduce service.
Edit: Neat. I came across: http://blog.xebia.com/2009/07/02/thinking-mapreduce-with-had...