How to Build a Popularity Algorithm You can be Proud of

ssn · on Sept 1, 2009

Interesting overview, however he fails to address scalability issues properly. Some of the algorithms presented need to periodically recompute each item's score - this is a drawback if scalability is what you are looking for. A scalable algorithm will compute each score on write and will not require batch updates of previous items.

See: http://code.google.com/appengine/articles/overheard.html

physcab · on Sept 7, 2009

After looking at a number of these algorithms, it seems like you really need to take each situation into consideration. I like that Google example as it is easy and scalable as you said. But for more complicated situations, you can do batch updates using Hadoop/MapReduce, assuming you don't have popular items that need to be calculated real-time.

profquail · on Sept 1, 2009

I really liked this article. I've been playing around with some social-news-rating algorithms of my own, which are quite different from any of the ones listed here. One of these days I'll find the time to sit down and code a site around it them...

Also...I'm pretty sure that his argument for the "Dampening The Weighted Votes By Record Age" section is wrong. If you assume that each vote has the same weight (like HN, Digg, etc.), then you can rearrange the terms so that it's possible to use an algorithm that updates the 'rating' of the story on-the-fly.

roundsquare · on Sept 2, 2009

> I've been playing around with some social-news-rating algorithms of my own

Where do you get the data for this? Its something I'd like to toy around with a bit but I have no idea how to get data.

mseebach · on Sept 1, 2009

Some discussion on a similar, but less comprehensive post: http://news.ycombinator.com/item?id=478632

pixcavator · on Sept 1, 2009

What you are likely to have is something like this: 1000 users and each voted between 0 and 100 times with 10 votes average. Yet with this approach all you have is a bag of 10,000 votes. It does not matter what you do with this bag - all the information on how the individual users voted is lost.

alain94040 · on Sept 1, 2009

None of these algorithms take into account the number of views for each item. Wouldn't you define the popularity by starting with the number of people who viewed the item, divided by the number of votes?

This would solve the issues reported about late night news being ignored because no one is around.

Anyway, great article.

stratomorph · on Sept 1, 2009

I think that would be a helpful factor in theory, but in practice would be too inconsistent to rely on. Using Reddit as an example (because I have no familiarity with HN's source) there is client-side javascript that intercepts clicks on links and appends story IDs to one of the Reddit cookies. Next time a request goes to Reddit, they get a list of recent clicks.

This can fail in a lot of ways, most obviously if Javascript or cookies are turned off. Also, the cookie isn't sent to Reddit until I load another page, so if I read an entire page of links and then close the browser without refreshing the page, the cookie doesn't get sent. Plus, the script clips the list around 20 elements, so even if I did refresh Reddit, it wouldn't know I'd clicked on more than 20.

My point is not the numerous weaknesses of Reddit's approach. Instead, it's self-reported information that must necessarily be suspect and incomplete. If an article on an obscure programming language pops up here, and every single person who reads it uses Lynx with cookies turned off for security, there might be no opportunity to record any views.

roundsquare · on Sept 2, 2009

I would think that would bias your algorithm towards articles with great headlines but not really great content.