Yesterday someone mentioned the Sling Blade Runner problem from their archives a...

jjs · on April 22, 2010

You're on the right track with the recursive dictionary search.

For the Sling Blade Runner problem, I would make a trie (http://en.wikipedia.org/wiki/Trie) out of hash tables, like this:

    trie = {
     'dracula': {'': None,
                'dead': {'and': {'loving': {'it': "Dracula: Dead and Loving It!"}}}},
     'the': {'brides': {'of': {'dracula': "The Brides of Dracula"}}},
    # ...
    }

Then, for each word in each movie title, walk the trie from the beginning until you run out of matches (no match), or run out of words (at least one match).

Your anagram server is similar, but since they're not ordered like movie titles, you have to walk your scrambled word(s) and only generate permutations that have a matching trie path.

It's also quite possible to do it for phrases, not just words.

You can cheat a little by representing your anagram as a 26-key multiset containing the number of times each letter appears (in Python, use a dict; in C, use an array of 26 integers).

For each letter that appears in your anagram multiset one or more times, make a copy of the original multiset, and recursively descend the Dictionary trie, removing each matching letter from a new copy of the multiset (they're tiny, so don't worry). For each complete word you find, add it to a list of result candidates, along with the multiset representing the remaining letters in the anagrammatic phrase.

Repeat the process for each result candidate, growing your list into a tree. Stop when you run out of matches, or when you run out of letters in your multisets. (You can set a maximum number of leftover letters to accept, if you're having trouble matching a phrase).

If you test the letters in alphabetical order, your result phrases will already be sorted.

chime · on April 22, 2010

> Then, for each word in each movie title, walk the trie from the beginning until you run out of matches (no match), or run out of words (at least one match).

I think that's the challenge ITA wants you to solve - recursing 300 levels deep into a list of 1000 movies will literally take forever. As you can see with the 2nd linked article, brute-force just doesn't work.

My original algo for the multi-word anagrams was nearly identical to what you described. The problem was when you entered a long phrase. You can form 8000 unique words from the letters in "William Jefferson Clinton." Even when I removed duplicates words with the same characters (art = tar = rat), it was still 7000+ long. Add in another 'r' or 's' to the original phrase and you're over 10,000 unique words. If I'm allowing my algorithm to make 5 word anagrams, that's 10^5 calculations for a single web query - absolutely impossible.

And even if I do magically return the 100k possible results in 0.1s, users can't parse through 100k results to find the ONE anagram that's funny or creative. I do have an idea of what I want to do. Maybe in a week or so, I'll have a demo working.

jjs · on April 22, 2010

I think that's the challenge ITA wants you to solve - recursing 300 levels deep into a list of 1000 movies will literally take forever. As you can see with the 2nd linked article, brute-force just doesn't work.

You're not recursing 300 levels deep; you abandon a subpath as soon as it doesn't match.

As for the anagrams, that sounds about right (if you install an anagram generator like wordplay, you can see that it generates a similar amount of matches).

You may want to devise a compact representation for a partly calculated results list, and just show n per page, with "prev" and "next" links that encode the original phrase and the first or last phrase on the current page.

When the user clicks "next", you run the algorithm as usual, but at each step of the trie search, you only search letters that are greater or equal to the current letter in the last known phrase.

You may also train a classifier like CRM114 to recognize "funny" and "boring" phrases, and omit the boring ones (except in your "prev" and "next" links!)

jules · on April 22, 2010

What do you mean by anagram solver? Find anagrams for a given word? Isn't that trivial by indexing words by the sorted version of the word? E.g. {"aep" -> ["ape", "pea"], ...}

chime · on April 22, 2010

I'm trying to built a better multi-word anagram solver. If you input a 32 character string, you'll end up with hundreds of thousands of potential anagrams and most of them are pretty useless. For example, "Florence Nightingale" returns 52k results on:

http://wordsmith.org/anagram/anagram.cgi?anagram=Florence+Ni...

And most of them are pretty lame e.g. "A Fleecer Tingling Hon", "A Fleecer Longing Hint" etc. It'll take hours before you come across "Flit on, cheering angel" - I want to make this process faster/easier.

pwk · on April 22, 2010

It's a fun project! I built one several years ago, after playing around with a bunch of different approaches. I wound up with a human-guided tool that seems to work pretty well for coming up with interesting ones:

http://anagramlogic.com/

jules · on April 22, 2010

Ah, interesting! This problem is NP-complete even in if you don't specifically want anagrams that are interesting to a human.

Suppose that you only have a single letter alphabet (say "a"). Now the input string can only contain a's, so it can be represented by a number n. The dictionary also consists only of words of a's, so the dictionary can be represented as a set of numbers D a subset of the natural numbers. Now the problem of finding an anagram is finding a subset of A of D such that sum(A) = n. But this is the subset sum problem, which is NP-complete!

How are you determining the interestingness of an anagram?

tedunangst · on April 22, 2010

Use random partitioning? If you have more than 10 letters, randomly split into roughly equal parts. Then do that 1000 times to the input string.

rimantas · on April 22, 2010

I wrote this snippet ~10 years ago when learning Perl:

  for(<>){chop;$s{join'',sort split'',lc}.=" $_"}for(sort%s){/. /&&print"$_\n"}

Running against wordlist.txt was fun (anemic cinema, bedroom boredom, etc.)

asimjalis · on April 22, 2010

jjs · on April 22, 2010

Until you get to multiple-word anagrammatic phrases, when it becomes a search problem.

(You could index a list of phrases the same way, but users wouldn't be able to type in their name and find out what phrases it jumbles into.)

pwk · on April 22, 2010

I agree, this is compelling for a lot of the same reasons as the anagrams problem. I spent weeks playing with anagram code at one point, and like you I felt an immediate compulsion to start writing code for Sling Blade Runner. So far my (ruby) code just does the obvious: builds the graph connecting up the movies and then uses brute force to find all chains of length two, then three, etc. Bogs down at chain length of about 11. Added some quick and dirty optimizations and pruning heuristics, and I'm geting chains in the mid-hundreds. I think this approach can be extended further, but isn't particularly clever :-/.

brown9-2 · on April 22, 2010

Are you looking to pre-generate the results or generate them dynamically? I think MapReduce will only be effective for the former.

chime · on April 22, 2010

There are no map-reduce implementations that work in real-time? I thought Google search worked like that. I do want it to work dynamically.

brown9-2 · on April 22, 2010

MapReduce is used to build the search index (constantly), and your searches are run against the index.

You might find some of these links useful:

http://wiki.apache.org/hadoop/ProjectDescription

http://wiki.apache.org/hadoop/HadoopIsNot

infinite8s · on April 23, 2010

You can use map-reduce on your search by partitioning up the starting keyspace alphabetically (26/num nodes), although with such an even partitioning the search will probably be unbalanced. But with some simple metrics on the starting letter frequency of words you could define a more balanced partitioning.