My Python Code for the Netflix Prize

Nogwater · on Aug 25, 2012

Here's mine if anyone is interested. I wrote it in D and haven't looked at it in years. I'm sure it's not usable as-is, but it might be fun anyway.

https://github.com/nogwater/NetflixPrizeD

The algorithm is based on Simon Funk's blog post here: http://sifter.org/~simon/journal/20061211.html

For me, the best part was squeezing the data and indexes into memory. :)

alexbw · on Aug 25, 2012

@tuananh I've got the dataset stored away, but I don't know if I'm legally allowed to post it. Would love if someone could produce proof one way or the other.

@viraj_shah I spent about 6 months working on the project before I had to stop to concentrate on my schoolwork (I was a senior in collge at the time). I think it would have been impossible to do this for myself without Cython. If it were to happen today, I would probably be writing in PyCuda, or with Numba, and it would be much, much, MUCH more succinct.

viraj_shah · on Aug 25, 2012

@alexbw Thanks for the info, great work!

arekp · on Aug 25, 2012

I wrote a 195-page monograph on the Netflix Prize, for people interested in that sort of stuff: http://arek-paterek.com/book

rabidsnail · on Aug 25, 2012

Comic Sans?

arekp · on Aug 25, 2012

right on :)

the beauty of self-publishing

richardlblair · on Aug 25, 2012

You indent by 8 characters.... I wanted to read your code but this will make my eyes bleed.

From pep 8: "Use 4 spaces per indentation level."

beaumartinez · on Aug 25, 2012

Close but no cigar. He's indenting with tabs.

skystorm · on Aug 25, 2012

Very nice. It might be helpful to (briefly) describe the actual techniques you tried in the readme file? At least that's the first thing I looked for...

viraj_shah · on Aug 25, 2012

This was incredibly kind of you to post up. It is great to see it public domain as many can learn from it. The Cython code looks scary though - 18k lines! May I ask how long you spent on this?

andreasvc · on Aug 25, 2012

Nitpick: binary blobs like .pyc and .so don't belong in a code repository. Instead you would put a makefile or setup.py to compile the .pyx files.

jdleesmiller · on Aug 25, 2012

I also worked on this at uni and had lots of fun -- those lessons certainly look familiar! We were trying to mine Wikipedia for more information on the movies. The code's here:

http://code.google.com/p/wikipedia-netflix/wiki/WikipediaNet...

It includes Wikipedia parsing stuff and a fairly fast C++ implementation of the very cool BellKor kNN algorithm.

textminer · on Aug 25, 2012

Love each of the individual BellKor approaches (http://www2.research.att.com/~volinsky/netflix/ProgressPrize...) for finding recommendations in the space of movies or users-- an MDS embedding, a PCA whitening, an NMF factorization by alternating least squares. Each of those hunches seems like the true art in these problems. The blending 100 of them together is far less interesting to me, though.

Yet that seems to be the sort of jockeying and tweaking these problems (seen now in Kaggle contests) seem to require. Is there an art or science then to the subsequent blending? Does one develop a better intuition for the problem at that point, or am I entirely missing the point of most ensemble methods (predictiveness over parsimonious understanding)?

beagle3 · on Aug 26, 2012

> Is there an art or science then to the subsequent blending?

You could regard this as an application of the "Smoothed expectation theorem", Saying E[X] = E[E[X|Y]]. That is, if you are trying to compute the expectation of something, you can make it depend on anything else, and compute the inner expectation with respect to that. Might seem trivial or useless, but it is wildly applicable and often significantly simplifies computations.

One of the practical implications is that if you're not sure about something (underlying model, specific parameters), just apply some prior distribution and compute the expectation over that -- it is essentially guaranteed* to provide a better result than trying to pick the correct setup.

Although I'm not sure what the interpretation here would be.

* - so long as the entropy of your prior is not more wrong than the entropy of your hyper-parameters. This is often the case.

textminer · on Aug 26, 2012

Yeah, the tower property! That made my day. Thanks for cleanly giving motivation and mathematical beauty to something that irked me up until now. Which is probably the problem of having your aesthetics drive you in the first place.

jdleesmiller · on Aug 25, 2012

I don't know of any deeper interpretation of blending. I guess one could look to the 'wisdom of crowds' for anecdotal evidence. There might be connections with consensus and voting systems, but those are mostly discrete AFAIK. It is at least a pragmatic way of exploiting the different biases of the members in the ensemble. In our case, the Netflix predictions we made based on Wikipedia data scored worse on their own than ratings-only predictions, but they attracted some weight in the blend and made the overall score (marginally) better.

bjano · on Aug 25, 2012

I also tried mining Wikipedia, but I found it's much more reliable to use the categories of the movies instead of the links.

The links are very often unrelated: "unlike in the movie X", "who by then had already became famous playing X in movie Z", etc. Instead of using the entire article, I only used links from the very first sentence. (" X is an Italian 1984 drama starring Z, etc...") These links could be connected to categories.

The problem I didn't expect was that the "hierarchy" of the categories is a mess. Seems easy in concept: "comedy-drama films" is a subcategory of both "comedy films" and "drama films", so you can use all the parent categories. In practice you get to completely unrelated categories in a few steps. The biggest challenge was cleaning this mess up algorithmically. Once that was done, I had a very nice taxonomy of the movies.

The result didn't come close anywhere to the top contestants, but it was a very interesting learning experience.

JacobiX · on Aug 25, 2012

Thanks for posting this. Your code combines two successful approachs : a latent factor model (SVD) and a neighborhood model :)

Here's my implementation of a recommender algorithm in C if someone is interested : https://github.com/GHamrouni/Recommender

surine · on Aug 25, 2012

Mine is at https://github.com/hbcdev/Netflix in C++, got to the top 500. It runs just inside my 8GB PC :)

tlocke · on Aug 25, 2012

I had a look though this and was confused by this line:

ratings[ratings<1.0] = 1.0

in

https://github.com/alexbw/Netflix-Prize/blob/master/src/pred...

Is it specific to NumPy? Or perhaps a Python trick I haven't seen before?

malkarouri · on Aug 25, 2012

Specific to NumPy. Boolean indexing.

http://www.scipy.org/Cookbook/Indexing#head-86055279f6592d36...

beambot · on Aug 25, 2012

Yep. The indexing and slicing are quite powerful. For those familiar with MATLAB, this is a fantastic resource: http://www.scipy.org/NumPy_for_Matlab_Users

tlocke · on Aug 25, 2012

Got it, thanks for the info and pointers everyone.

jdleesmiller · on Aug 25, 2012

I believe that's an example of 'logical indexing', which is a useful feature in many vector-oriented languages / packages:

http://blogs.mathworks.com/steve/2008/01/28/logical-indexing...

tuananh · on Aug 25, 2012

great for ref. however i can't found the dataset anywhere on the internet :(

alexholehouse · on Aug 25, 2012

Netflix had to pull the dataset after a researchers at U Texas were able to de-anonymize the dataset, shown in their paper Robust De-anonymization of Large Sparse Datasets.[http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf]

raheemm · on Aug 25, 2012

Good readme file. Liked the lessons learned.

marklit · on Aug 25, 2012

Does anyone have any thoughts on the lack of conformity to PEP8? His code works and it's valid python but I feel that it's difficult to read.