More data usually beats better algorithms

gaika · on March 31, 2008

Not true - from current leaders of netflix prize:

Our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.

http://glinden.blogspot.com/2008/03/using-imdb-data-for-netf...

greendestiny · on March 31, 2008

The students used a simple algorithm and got nearly the same results as the bellkor team. So the extra data isn't redundant if it enables a simpler algorithm to perform as well as a more complicated one, even if the complicated algorithm gets no benefit from the extra data.

mhb · on March 31, 2008

What's the name of the student team and what was their result?

elq · on April 1, 2008

bingo... I've talked to Yehuda and he does know what he's talking about. Given that he works for Bell Labs, it might be within the realm of possibility that he actually tried using movie metadata.

If those students have actually come close to BellKor's score, they should submit their qualifying set - Yehuda already has a 9.05% improvement.

gaika · on March 31, 2008

Would you, as a human, prefer a model that gives you accurate predictions with less data, or a model (that is simpler) but needs more data to be accurate?

How is an algorithm any different?

greendestiny · on March 31, 2008

Well put it this way, to get good predictions without a lot of data takes a whole team of researchers at bell labs working on this for over a year, with data its a student project.

timr · on March 31, 2008

That's a red herring, and here's why: simpler models take less data to train.

Every orthogonal parameter that you add to a statistical model expands the probability space of that model exponentially. If you keep the amount of training data fixed, but keep adding parameters, you'll very quickly arrive at a point where your model will fit your training data exceptionally well, but the result will be meaningless, because you've over-fit your data.

Now, I suppose you could make the case that new models are occasionally discovered that allow for more efficient uses of existing data. However, this is a decidedly rare event. Most of the time, gains are made in exactly the way the author suggests -- by adding new sources of data to well-established techniques.

michaelneale · on March 31, 2008

Well - thats what google keep saying (specifically, I think Peter Norvig has said that, I think, over and over - that he hasn't had access to as much data before and its fascinating to him).

Ah Peter Norvig, responsible for hours of my time wiled away on his web site with all sorts of knowledge porn.

henning · on March 31, 2008

One talk in particular in which he focuses on this idea, including with facts and figures from research literature, is a talk he gave a while ago called "Theorizing from Data": http://www.youtube.com/watch?v=nU8DcBF-qo4 . You should definitely watch it if you haven't seen it before.

Norvig states his opinion slightly differently: of course being smart about algorithms is good. But until you get a lot of it, you often can't even fairly evaluate different algorithms. It's only when you're no longer getting significant gains from more data that you should then start thinking about being an algorithm smartypants.

sammyo · on March 31, 2008

Now that is just a marvelous restatement of the canonical optimization principle!

jsomers · on March 31, 2008

"Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB)."

I had no idea imdb had so much genre data. E.g., a "keywords" page for every movie [http://www.imdb.com/title/tt0062622/keywords] and, for every keyword, maps of (a) related keywords and (b) movies that mention it [c.f., http://www.imdb.com/keyword/metaphysical/].

Very cool.

stcredzero · on March 31, 2008

More like bad/insufficient data defeats even good algorithms. When we recommend movies to friends, we are often using very different and more useful information than what's in the Netflix database.

csmajorfive · on March 31, 2008

I did the same thing last semester inspired by a class at Cornell. We came up with a very, very simple graph-based algorithm that gets above the competition's baseline (not quite bellkor level but there's lots of tweaking left to be done).

Now -- I was under the impression that using extra proprietary data (like imdb) is beyond the bounds of the competition. Can anyone shed some light on this? Maybe I should pick up the project again!

msg · on April 1, 2008

"Why provide rating dates and movie names?"

Other datasets provide it. Cinematch doesn’t currently use this data. Use it if you want.

As it happens we provided years of release for all but a few movies in the dataset; those seven movies have NULL as the "year" of release. Sorry about that.

"Why not provide other data about the movies, like genres, directors, or actors?"

We know others do. Again, Cinematch doesn’t currently use any of this data. Use it if you want.

http://www.netflixprize.com/faq

staunch · on March 31, 2008

> I was under the impression that using extra proprietary data (like imdb) is beyond the bounds of the competition.

Kobayashi Maru Maneuver[1] time. Really though, I'm actually curious about this too. Anyone know if it definitely is against the rules?

1. http://en.wikipedia.org/wiki/Kobayashi_Maru#Other_test-taker...