Any such demonstrations would be totally anecdotal though.
There's no way to say its recommendations are quantitatively better, you just have to try it and see whether their machine learning stuff is better than Apple's ML stuff.
Yes, you could definitely gather non-anecdotal, statistical evidence, but I think you're missing dschobel's main point. I think he's making a point about the inherently subjective nature of "quality of recommendation", not suggesting the obviously stupid idea that it's impossible to gather statistical data on a recommendation engine.
And to that point, I have to agree with him. At least, most of the obvious choices for a quality measurement seem quite flawed. Here are the first few I thought of:
- Apps downloaded (this is more about how good you are at getting people to donwload an app, and recommendations are only a small part of that)
- Subjective direct rating (averaging subjective experience doesn't tell you what's good, it just tells you what people perceive as good)
- Length of usage (measures how much fun you made the rating "game", not how good your recommendations are)
This isn't an accident. Recommendation quality is inherently tricky, because if you knew how to figure out what you should recommend to any given person...that would be your recommendation engine! So you're always using a proxy for goodness, rather than a direct measurement, and that abstraction tends to leak.
Presumably you already know how Netflix does that and you're asking this rhetorically. If you are actually interested in how Netflix evaluated submissions, you can read all about it on their contest page.
I thought about the Netflix example, which is obviously related, while writing my post. Ultimately the Netflix competition isn't for a recommendation engine, it's a prediction engine. They use that prediction engine to produce recommendations, but that's a separate phase. The top N predicted ratings != the best N things to recommend at this moment, though obviously it's useful to know predicted ratings for movies if you're writing a recommender.
A recommendation engine should not just recommend the highest predicted rated apps given that the user downloads the app. As an extremely simple example, even if you habitually rate games much higher than other apps, it shouldn't recommend you 100% games. You probably want to see other things sometimes too. The perfect set of apps on your phone would not be entirely games; you do like to twitter after all, even if you're not entirely fond of any of the twitter apps out right now.
The perfect recommendation engine would be psychic; it would know the set of apps on your phone that would make you maximally happy. That's obviously not the same as predicting exactly what you'd rate each app (though presumably you could do that easily).
So if I take the history set of Apps rated/downloaded by a user and give the Apps recommender only half of it to base it's recommendation, is that a prediction engine or recommendation engine?
I'm surprised the DirectEdge folks havent done anything on app recommendations yet - not only will it be a huge hit if it works well, it'll bring in some great PR as well.