How Not to Sort by Average Rating (2009)

stdbrouw · on July 9, 2015

Also discussed in Cameron Davidson-Pilon's Bayesian methods for Hackers in the context of Reddit ups/downs: http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabil...

For Amazon, though, which is the example in Evan Miller's post, I don't really get why you'd first dichotomize the five-star rating into positive vs. negative and then use Wilson intervals. Just construct a run-of-the-mill 95% confidence interval for the mean of a continuous distribution and sort by the (still plausible) worst case scenario a.k.a. the lower bound of that: `mean - 1.96 * SE`, where the standard error is `SE = stddev(scores)/sqrt(n)`.

Because of the central limit theorem, you can do this even if scores are not normally distributed and it'll work out too.

vcdimension · on July 9, 2015

For better accuracy with small samples you could use the multinomial distribution instead. The covariance matrix for the rating probabilities can be found here for example: http://www.math.wsu.edu/faculty/genz/papers/mvnsing/node8.ht... Then the variance for the expected rating can be calculated as a weighted sum of the values in the covariance matrix.

These companies really should be hiring statistics consultants instead of relying on the intuitions of their programmers.

stdbrouw · on July 9, 2015

I'd prefer to just treat scores as continuous and correct using `t_ppf(.975, n-1)` instead of the normal approximation (1.96) but I suppose working from a multinomial distribution would give pretty similar results.

vcdimension · on July 9, 2015

You're still relying on the central limit theorem (i.e. a reasonable amount of data) : using t instead of z just corrects for the fact that you only have sample variances instead of population variances. However, I suppose it's not unreasonable to assume that the ratings are likely to have a bell shaped distribution (which could be checked), so the normal/t approximation is probably going to be OK.

stdbrouw · on July 9, 2015

Ah yes, true. Let's call it a bias/variance tradeoff ;-)

danialtz · on July 9, 2015

I have fundamental problem with democratic voting systems. Whatever the general view likes will tend to come on top, hence cat pictures on reddit. The most philosophiycally elegant solution I've encountered so far is the "quadratic voting" (see https://news.ycombinator.com/item?id=9477747), where every user has a limited number of credits to spend per time-period. Every vote will have a quadratic cost.

Assume user has/obtains 1000 karma points a month. If the user merely likes a post or not he gives his 1 vote, which cost him 1 karma. If he strongly wants one post up he can spend a maximum of 31 vote on it. This way the minorities will have also extra influence on the voting process.

The requirements is that each user have 1 account, e.g. maybe by some form of payment for 1000 karma to avoid fake voting fraud. Maybe using bitcoins if it picks up to avoid privacy problems.

Do you think this will hinder the workings of sites like reddit?

bwy · on July 9, 2015

What I always thought was that there really should be some user-based weighting system. Like, if a user upvotes 90% of the things he sees, his upvotes are probably worth less than upvotes by someone who upvotes only 1% of the posts he sees.

Same thing applies to things like Yelp reviews. Maybe a user with close to a 5-star lifetime rating average should have his reviews "renormalized" to 3's because his standards are probably just lower than the guy with a 1-star average.

The problem is that there are so many other factors here (maybe the 5-star average person only visits (? or just reviews) really good places). Maybe the crazy upvoter just spends more time reading each page on Reddit. These are complicating factors that are hard to predict and if the simple case is working, why try? If there were a simple, clearly better way of voting/rating, it would be done.

marcosdumay · on July 9, 2015

Yes, Bayesian weighting. I feel that every kind of pool needs Bayesian correcting for calculating how much information does an opinion really carry.

The problem of people only visiting isn't really a problem. If a person only visits good places, he'll be perfectly able to differentiate those good places from each other, and can say one of them is bad. Yet, somebody that visits good and bad places has a saying on what places are good, and what ones are bad.

dredmorbius · on July 9, 2015

A hybrid variant: 1,000 points/month, 100/day (yes, higher than the monthly average). Exceeding either on starts reducing the weight of total votes.

The periods should probably be rolling averages and apply to weights for CURRENT votes. Since early voting activity has undue influence, often in the first few minutes of contents' existence, retroactively deflating month-old ratings doesn't do much.

The idea is to enable reasonable inputs for a time, then start washing them out.

The deflation factor might be applied more broadly across other indicators (IP blocks, etc.).

Or ratings factored for conformance with stated site moderation goals. See my longer top-level comment.

joshuapants · on July 9, 2015

> Same thing applies to things like Yelp reviews. Maybe a user with close to a 5-star lifetime rating average should have his reviews "renormalized" to 3's because his standards are probably just lower than the guy with a 1-star average.

Another problem is the perception of star ratings. It seems like 5 (maybe 4 also) is the only "positive" rating for many people. Anything less and it might as well be a 1.

bwy · on July 9, 2015

Of course, there's many areas that I didn't even mention. Another thing in the same vein for upvotes that I sometimes think about is, what is the meaning of upvoting? Does it mean, "I like this," or can it also mean "I think this submission should be higher?" Maybe I think too much but I've refrained from upvoting posts I like because I don't think they should be higher than their current position.

travis_bickle · on July 9, 2015

Can the design tell the user the meaning of upvote("I like this" / "I think this submission should be higher")? Because the significance of upvote could be either of the two depending on the area. Also, then one can sort accordingly.

jmilloy · on July 9, 2015

Criticker does this well. But as you say, I try not to waste time watching movies I don't think I will like, so there is a good reason my ratings are biased towards the high end.

timon999 · on July 9, 2015

But this encourages donwnvoting, which is something you definitely do not want.

FooNull · on July 9, 2015

Why do you not want downvoting? Isn't that desirable part of the opinion gathering?

Cthulhu_ · on July 9, 2015

> Do you think this will hinder the workings of sites like reddit?

I do. The user will think "Well given that I only have X to spend, I'd better be careful what I vote for". That will lead to less votes being placed, and thus, reduced interaction / community activity.

It's the same reason you don't want to cap the amount of submissions or comments someone makes. A flood control (max X per Y time period) to stop bots and spammers, sure, but no hard cap - or an inhumanly high one at least, bearing in mind that some people are indeed inhuman when it comes to voting/commenting.

squeaky-clean · on July 9, 2015

I agree that showing the user that they are expending some sort of resource by voting will cause them to act differently or be more conservative with it.

An idea I've had is handling this behind the scenes by weighing the votes accordingly. Let's say each user only has 1,000 "points" but infinite votes. Votes 1,000 times? Each vote == 1 point. Just vote once? That vote is worth 1000x as much as the other guy. Spam a million votes on everything, and each one is basically worthless.

Of course, handling it exactly like that is way too simple, it would need to be balanced more carefully than that. In that naive example, someone who reddits for 5 minutes per month and votes once has as much power as someone online 24/7. Users would also be able to figure out how the system works if scores are visible in some way.

The specific use I played with this for (but never got around to finishing the project) was for movie reviews. There are people who will rate 50% of moves 5-stars and 50% of movies 1-star, but never in between. And some people who are much more selective and may give a 5-star rating to only a few movies ever. A 5-star rating from both of these users should not be valued the same.

brudgers · on July 9, 2015

If people are upvoting "the right things" then limiting votes is counter-productive. If people are upvoting cat-pictures, then the problem is that the site is not selecting the right group of users and better filters are needed so that the community has the desired structure.

brudgers · on July 9, 2015

Beyond the concerns you raise, vote caps etc., inhibit problematic behaviors mechanistically rather than via better social structures. Limiting the number of posts a person can make per day is one way of addressing potential flame wars when the underlying problem is that flaming is deemed acceptable behavior [and consequently such sites attracts people who engage in flaming and encourage some people to post in ways that get all their digs in at once].

The alternative of building community norms is harder. Mechanism can support it, but success depends on the over-arching social structure.

lmm · on July 9, 2015

> I do. The user will think "Well given that I only have X to spend, I'd better be careful what I vote for". That will lead to less votes being placed, and thus, reduced interaction / community activity.

It's worth pointing out that Slashdot has run successfully for decades with users getting limited moderation points.

dredmorbius · on July 9, 2015

Slashdot's moderation applies to comments only, not stories, and leaves a great deal of good discussion on the floor.

And other issues: https://news.ycombinator.com/item?id=9854240

billmalarky · on July 9, 2015

I think sites like reddit just want engagement, even over quality. It's been known forever that reddit's algorithm leaves much to be desired but it's too risky to change.

IE the following (since patched) bug: http://technotes.iangreenleaf.com/posts/2013-12-09-reddits-e...

jmilloy · on July 9, 2015

I find this to be a strange comment. Quadratic voting is much better if we have to reach a consensus and want to protect minorities. So it has it's application (and is woefully under-used). But surely it's not always the best method. I think often you want the most popular comment to be at the top of the list, because it's what appeals to the most people.

Secondly, one of the big problems that quadratic voting addresses is when an ambivalent majority outweighs a minority simply because everyone votes. But not everyone has to vote on every comment on a site like reddit, so the ambivalent majority is already ignored.

nzealand · on July 9, 2015

It might work, if there was a nominal cost to creating an account.

It would be better to take the recommendation engine of netflix. Just throw it on top of reddit and charge a fee for tailored recommendations. Now that is something I would pay money for.

dredmorbius · on July 9, 2015

Gresham's Law.

Fundamentally: assessing quality of complex products, including information goods, is hard.

woah · on July 9, 2015

Wrong solution #1 sounds like it could work quite well for UrbanDictionary, since it would tend to reward posts that have a lot of engagement. It's probably a good solution for a lot of sites.

imh · on July 9, 2015

The problem here is feedback. The higher rated ones get higher rated, so more people see it so it gets higher rated. That opens up a whole extra can of worms you don't want to deal with.

sova · on July 10, 2015

This is exactly true. You gotta balance freshness, quality, and uncertainty.

Udo · on July 9, 2015

I agree. It's a good solution for all cases where the intent is to have a negative vote exactly cancel out a positive.

The method in the article combines a quality rating with a quantity rating, but it's a bit unwieldy and difficult to tune intuitively. It seems to me for a lot of purposes you might get a sufficiently similar effect by using method #1, and then multiplying the result with the sigmoid function applied to the ratio. The advantage of this would be you that the only magical numbers in the formula would be tuning factors you put in yourself.

This seems more appealing to me than "((positive + 1.9208) / (positive + negative) - 1.96 SQRT((positive * negative) / (positive + negative) + 0.9604) / (positive + negative)) / (1 + 3.8416 / (positive + negative))*"

stdbrouw · on July 9, 2015

YMMV, but I'd rather use an elementary and mathematically sound statistical technique than macguyvering something myself. (Though I do understand that, as some people describe in this thread, there can be different purposes to ratings and hence a need for different sorting mechanisms.)

You're right that confidence intervals depend on both quality and quantity, but the reason for this is to account for uncertainty. As n goes up, the standard error goes down to practically 0, and so e.g. a 4.2/5 movie with 100 reviews is still likely to be sorted higher than a 4.1 movie with 200 reviews. Quantity only comes into play when there is very little information to go on, after that quality becomes the driving factor.

learnstats2 · on July 9, 2015

Using internet points this way is not mathematically sound.

You're using a highly biased sample, to begin with. The mathematics here start with the assumption that you have a random sample. You don't. The assumption is invalid; and there's no reason to believe this calculation is a good one compared with any other.

A particular type of person votes on particular things: it's commonly observed that new movies rate higher than they should on IMDB, because super-fans are the first to vote. (This is convenient for your example)

stdbrouw · on July 9, 2015

Inappropriate for obtaining unbiased confidence intervals, yes, but this doesn't matter for ranking if bias is uniform across everything that's being rated.

You make a good point that there might be differential bias depending on when the movie came out, but I don't think the solution is then to say "well, now all bets are off, might as well concoct our own techniques and assume they're just as good or better." Statistical techniques are not either fully valid or fully invalid. Simulate the bias and look at exactly how it influences the results of a particular technique, and then perhaps use that to make an adjustment based on data rather than intuition.

Udo · on July 9, 2015

It's not about macguyvering something, it's about reflecting on how well different solutions stack up with what you actually want to do. There is no one-size-fits-all ranking formula, at least in my opinion.

rjst · on July 9, 2015

I’ve also seen a variation of this that gives different weights to positive and negative votes, the theory being that the likelyhood of you actually voting when you like/don’t like isn’t the same.

Cthulhu_ · on July 9, 2015

This is true, especially for things like app store ratings - people are much more likely to leave a rating if they really don't like something. Or if they really do like something for that matter.

Nowadays some apps encourage users more to leave a rating though. I was under the impression that Apple did not approve of those practices, although I might be confused with another rule that stated apps should not suggest 5-star ratings.

The Play Store, iirc, and probably Steam too, have a smart algorithm that display the 'most helpful' reviews at the top, so that as a potential customer you can determine the pros and cons of the app according to other users. Ratings are often very polarized - like up / downvotes.

learnstats2 · on July 9, 2015

Absolutely agree. The UrbanDictionary ordering seems nearly perfect for most internet purposes.

If this article wants to make its point, it should show cases where its ordering differs from UrbanDictionary.

bbrazil · on July 9, 2015

Previously: https://news.ycombinator.com/item?id=478632

ggreer · on July 9, 2015

It was also discussed about 3 years ago: https://news.ycombinator.com/item?id=3792627

Some of the comments from that posting give concrete examples where the formula fails. Such as: an item with 1000 upvotes and 2000 downvotes will get ranked above one with 1 upvote and 2 downvotes. This is because the formula uses the lower bound of the Wilson interval.

rurban · on July 9, 2015

I'm ranking movies by critics ratings. Most of them have too low numbers, and thus naive bayesian ranking by avg does not work. IMDB gets away with it, but I cannot. And you should be able to see a good preview of the expected ranking even with low numbers.

So you need to check the confidence interval with Wilson, but you also need to check the quality of the reviewer. There are some in the 90%, but there are also often outliers, i.e. extremities. Mostly french btw.

I updated the c and perl versions, compiled and pure perl here: https://github.com/rurban/confidence_interval

anon4327733 · on July 9, 2015

First two points are great, but why then we see this:

"Given the ratings I have, there is a 95% chance that the "real" fraction of positive ratings is at least what?"

What normal person thinks in terms of confidence intervals?

The obvious answer is people want the product with the highest "real" rating. That is the rating the product would get if it had arbitrary many ratings.

To get this you just find the mean of your posterior probability distribution. For just positive and negative reviews thats basically (positive+a)/(total+b) where a and b depend on your prior.

His proposal would mean that a product with zero reviews would be rated below a product with 1 positive review. This may deal with spam and vote manipulation since things with less information are penalized more but that is a separate issue.

spacemoelte · on July 9, 2015

I have always wondered what amazon was thinking with that way of sorting. Perhaps it's a deliberate way to spread purchases out over a span of products instead of just the two top products?

randomtree · on July 9, 2015

I think it's about a product discovery. If we always sort this way, new products don't have a chance.

And I don't think Amazon would sort like this, it would make more sense for them to use hn/reddit way to sort items that give a chance for the new items to get to the top.

Houshalter · on July 9, 2015

There is a much simpler and elegant method. Just rank posts by their probability of getting an upvote. This is just (upvotes+1)/(upvotes+downvotess+2).

stdbrouw · on July 9, 2015

This gives an advantage to new posts for which the probability is much more uncertain: it's easier to get 1 upvote and 0 downvotes (rank 2/3) than to get 1999 upvotes and 999 downvotes (also rank 2/3). Maybe that's what you want, but the post is exactly about those cases when this is not what you want.

dredmorbius · on July 9, 2015

New posts frequently start with a disadvantage though in existing systems. Temporarily biasing them favourably increases odds of any moderation. Alternatively you could present them only to a subset of readership. I've suggested this as a solution to HN's new submissions queue problem.

Increase the presentation as ratings increase.

Houshalter · on July 9, 2015

You can of course use a better prior if you don't think new posts are 50% likely to receive an upvote next. Otherwise I don't see how that behavior is undesirable or against the goal of the article.

fludlight · on July 16, 2015

How about fn(upvotes, downvotes)/fn(pageviews, prominence)?

anon4327733 · on July 9, 2015

That is true for a uniform prior but not a general prior.

jrochkind1 · on July 9, 2015

Interestingly, I _think_ the Reddit algorithm basically makes this mistake too -- although embedded in a more complicated algorithm that combines with 'newest first' altered by positives minus negatives.

I don't think the HN algorithm is public, but wouldn't be surprised if it does the same.

Perhaps the generally much smaller number of 'votes' on a HN/reddit post makes it less significant.

gsteinb88 · on July 9, 2015

For posts, I'm not sure what the algorithm is (I think it's deliberately more complicated, and has to take into account time of posting?), but after this article [the op] was written, reddit implemented the method for comments, as explained by Randall Munroe: http://www.redditblog.com/2009/10/reddits-new-comment-sortin...

You only get this ranking method if you sort the comments by 'best' though

jrochkind1 · on July 9, 2015

Not the default 'top' though, I think.

sova · on July 10, 2015

Awesome if you are using only "up" and "down" ...

Houshalter · on July 10, 2015

It should work for star ratings, and generalize to non discrete rating systems too.

discardorama · on July 9, 2015

How well does this work when you don't have a binary (+/-) rating system, but a multi-valued one (1 - 5 stars) ?

cgearhart · on July 9, 2015

See the discussion elsewhere in this thread for confidence intervals in multinomial distributions. https://news.ycombinator.com/reply?id=9856607&goto=item%3Fid...

jmilloy · on July 9, 2015

You could replace p/n in the formula with the average star rating and still get a reasonable ranking order.

dredmorbius · on July 9, 2015

First off, for anyone looking at web reputation systems, read the book on the subject: Randy Farmer and Bryce Glass, Building Web Reputation Systems:

Book: http://shop.oreilly.com/product/9780596159801.do

Wiki: http://buildingreputation.com/doku.php

Blog: http://buildingreputation.com/

I can pretty much guarantee there are elements of this you're not considering which are addressed there (though there are also elements which Farmer and Glass don't hit either). But it's an excellent foundation.

Second: If you're going to have a quality classification system, you need to determine what you are ranking for. As the Cheshire Cat said, if you don't know where you're going, it doesn't much matter how you get there. Rating for popularity, sales revenue maximization, quality or truth, optimal experience, ideological purity, etc., are all different.

Beyond that I've compiled some thoughts of my own from 20+ years of using (and occasionally building) reputation systems myself:

"Content rating, moderation, and ranking systems: some non-brief thoughts" http://redd.it/28jfk4

⚫ Long version: Moderation, Quality Assessment, & Reporting are Hard

⚫ Simple vote counts or sums are largely meaningless.

⚫ Indicating levels of agreement / disagreement can be useful.

⚫ Likert scale moderation can be useful.

⚫ There's a single-metric rating that combines many of these fairly well -- yes, Evan Miller's lower-bound Wilcox score.

⚫ Rating for "popularity" vs. "truth" is very, very different.

⚫ Reporting independent statistics for popularity (n), rating (mean), and variance or controversiality (standard deviation) is more informative than a single statistic.

⚫ Indirect quality measures also matter. I should add: a LOT.

⚫ There almost certainly isn't a single "best" ranking. Fuzzing scores with randomness can help.

⚫ Not all rating actions are equally valuable. Not everyone's ratings carry the same weight.

⚫ There are things which don't work well.

⚫ Showing scores and score components can be counterproductive and leads to various perverse incentives.

I'm also increasing leaning toward a multi-part system, one which rates:

1. Overall favorability.

2. Any flaggable aspects. Ultimately, "ToS" is probably the best bucket, comprising spam, harassment, illegal activity, NSFW/NSFL content (or improperly labeled same), etc.

3. A truth or validity rating. Likeley rolled up in #2. But worth mentioning separately.

4. Long-term author reputation.

There's also the general problem associated with Gresham's Law, which I'm increasingly convinced is a general and quite serious challenge to market-based and popularity-based systems. Assessment of complex products, including especialy information products, is difficult, which is to say, expensive.

I'm increasingly in favour of presenting newer / unrated content to subsets of the total audience, and increasing its reach as positive approval rolls in. This seems like a behavior HN's "New" page could benefit from. Decrease the exposure for any one rater, but spread ratings over more submissions, for longer.

And there are other problems. Limiting individuals to a single vote (or negating the negative effects of vote gaming) is key. Watching the watchmen. Regression toward mean intelligence / content. The "evaporative cooling" effect (http://blog.bumblebeelabs.com/social-software-sundays-2-the-...).

fahadalie · on July 9, 2015

The most important criteria for sorting should be the 'engagement'.

dredmorbius · on July 9, 2015

That's a useful element to consider, and I do favour implicit ranking inputs, but "best" requires you know your goal. What are you selecting for?