The students used a simple algorithm and got nearly the same results as the bellkor team. So the extra data isn't redundant if it enables a simpler algorithm to perform as well as a more complicated one, even if the complicated algorithm gets no benefit from the extra data.
bingo... I've talked to Yehuda and he does know what he's talking about. Given that he works for Bell Labs, it might be within the realm of possibility that he actually tried using movie metadata.
If those students have actually come close to BellKor's score, they should submit their qualifying set - Yehuda already has a 9.05% improvement.
Would you, as a human, prefer a model that gives you accurate predictions with less data, or a model (that is simpler) but needs more data to be accurate?
Well put it this way, to get good predictions without a lot of data takes a whole team of researchers at bell labs working on this for over a year, with data its a student project.
That's a red herring, and here's why: simpler models take less data to train.
Every orthogonal parameter that you add to a statistical model expands the probability space of that model exponentially. If you keep the amount of training data fixed, but keep adding parameters, you'll very quickly arrive at a point where your model will fit your training data exceptionally well, but the result will be meaningless, because you've over-fit your data.
Now, I suppose you could make the case that new models are occasionally discovered that allow for more efficient uses of existing data. However, this is a decidedly rare event. Most of the time, gains are made in exactly the way the author suggests -- by adding new sources of data to well-established techniques.
Well - thats what google keep saying (specifically, I think Peter Norvig has said that, I think, over and over - that he hasn't had access to as much data before and its fascinating to him).
Ah Peter Norvig, responsible for hours of my time wiled away on his web site with all sorts of knowledge porn.
One talk in particular in which he focuses on this idea, including with facts and figures from research literature, is a talk he gave a while ago called "Theorizing from Data": http://www.youtube.com/watch?v=nU8DcBF-qo4 . You should definitely watch it if you haven't seen it before.
Norvig states his opinion slightly differently: of course being smart about algorithms is good. But until you get a lot of it, you often can't even fairly evaluate different algorithms. It's only when you're no longer getting significant gains from more data that you should then start thinking about being an algorithm smartypants.
"Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB)."
More like bad/insufficient data defeats even good algorithms. When we recommend movies to friends, we are often using very different and more useful information than what's in the Netflix database.
I did the same thing last semester inspired by a class at Cornell. We came up with a very, very simple graph-based algorithm that gets above the competition's baseline (not quite bellkor level but there's lots of tweaking left to be done).
Now -- I was under the impression that using extra proprietary data (like imdb) is beyond the bounds of the competition. Can anyone shed some light on this? Maybe I should pick up the project again!
Other datasets provide it. Cinematch doesn’t currently use this data. Use it if you want.
As it happens we provided years of release for all but a few movies in the dataset; those seven movies have NULL as the "year" of release. Sorry about that.
"Why not provide other data about the movies, like genres, directors, or actors?"
We know others do. Again, Cinematch doesn’t currently use any of this data. Use it if you want.
Our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.
http://glinden.blogspot.com/2008/03/using-imdb-data-for-netf...