Hacker News new | past | comments | ask | show | jobs | submit login
Elo sucks – better multiplayer rating systems for smaller games (2019) (medium.com/acolytefight)
161 points by brownbat on July 21, 2020 | hide | past | favorite | 155 comments



The author didn't benchmark to see if this system is actually any better at predicting outcomes than vanilla Elo. That's how you determine if your implied win probabilities are accurately being derived from rating differences. The author seems to be under the impression that there's something fixed and concrete about an 1800 rating, but when you change the system, you also change what an 1800 rating means in the first place.

Some of these complaints are solved by existing systems, namely Glicko. For example, rating deviation helps with experienced players (low RD) losing points to newer players (high RD). It also has a built-in way to discourage inactivity. Players' RD increase over periods of inactivity, so they can be excluded from the leaderboard after reaching a certain point. That allows us to maintain their rating without decreasing it. After all, that's our best guess of the player's skill. It's just a less reliable guess over time.


There have been four rating systems, including Glicko and TruSkill, received lots and lots of complaints for both those systems. This new system receives few complaints. Tested across 135000 players. If the players had not complained so much, we would still be on Glicko. Those are the facts. The theories as to why that is are up to you.


Optimizing a rating system for minimal complains, maximum player engagement, or some similar metric is of course totally valid. It reminds me of Sirlin's story of being hired to design a rating system for Starcraft 2, and optimizing for totally different things that Blizzard wanted [0].

If the author (you?) had just described it in those terms, it'd be hard to object. But the article goes further, and makes claims about the system being more accurate due to a different rating curve. That's the claim that would need to be justified by actually comparing whether the predictions the new system makes really are better.

[0] http://sirlingames.squarespace.com/blog/2010/7/24/analyzing-...


Part of a matchmaking algorithm that increases user engagement is telling a story about how it's more fair though.

Like our tolerance for losing is acquired. Most normal people losing in League for the first time stop playing, usually forever. Just randomly visit your friend's match histories in League, frequent players have many days of long losing streaks.

If you're just conditioned to play despite losing, great, in a Darwinian way (surviving, being around to be measured) you will be representative of the average player in League. And there are so many League players with such long retention you cannot possibly argue that skill-based matchmaking is the core component of user engagement.

His dataset is interesting because it will necessarily overrepresent people who kept playing despite the old system. That sort of refutes its importance - I mean sure people complain but they keep playing, so was it really that important? So what if complaints go down?

Those are important goals, and also, it's still an interesting twist in multiplayer game design. You just gotta interpret it as a commentary on a whole system even if it doesn't narrowly talk about a scientific objective like performance prediction.


Player retention was significantly lower before the new rating system was introduced. It wasn't just complaints, it's just that is a more direct metric because multiple changes may have affected the retention.


The funny thing about SC2 is that the player's MMR (matchmaking rating) is decoupled from rank, due to design decisions such as demotions not occurring midseason. So a gold league player with a low enough MMR may get matched against bronze league players, despite ostensibly being ranked higher. Actually for the longest time after release, player MMR was not visible. It took 6 years and 2 expansions before it was finally displayed in-game.


I like the league system in sc2, it allows you to see progress and to get to know other people's style (because you meet the same players several times in a quick succession). People build narratives around that (I remember that guy I lost vs him last time and now I had my revange).

If somebody won against you with a particularly dirty strategy you can get back at him and you won't be fooled the next time.

It's better than dota system where you're most likely never going to play vs the same people again (or at least it feels like that).

Also the whole 5v5 game that lasts for 1 hour format is inherently very toxic - most of the time you know you lost after 10 minutes and yet you can't disconnect and have to keep playing for the next 30-50 minutes while everybody blames everybody else on your team.

In sc2 if you know you lost after 5 minutes it's discrespectful to continue playing :)


Oh, I love those guys who "know you lost after 10 minutes" in Dota...


So there's 10% chance I'm wrong. That's 10% chance of losing some mmr vs 90% chance of losing 30 minutes of my life. I don't have much time to play so in sc2 I would just go to the next game.

Sadly in a team game you can't make that choice because disconnecting too much will prevent you from playing all the fun modes (and will put you into a low priority queue where you play with people who constantly insult everybody or kill their own allies).

That's the root cause of all the toxicity in dota.


I've played Dota for way more hours than I want to admit, and unless you're playing on the highest of levels (Immortal) where people really have a very strong feeling for the win-conditions that they have and when it becomes very hard to reach them, I call BS on your statement. There are very strong reasons why pro players almost never call it GG before at least one rax is down, and often wait to do so for at least one last hail mary fight (depends a bit on the patch though).

The chance that the enemy throws a fight or makes other mistakes that opens up new win conditions (or sometimes that simply one guy in the other team tilts and throws) is just too high in Dota in order to be able to call it quits after 10 minutes.

One of the main reasons for the toxicity in Dota in my opinion is that people don't understand that there are 10 variables in the field, but they can only control one of them. Don't even bother trying to influence the other 9 (apart from positive feedback/encouragement/communication to the team), it's a waste of time and only contributes to the toxicity.


I'm at 2.5k, almost always playing as pos 5 winter wyvern. I'm under no impression that I'm great at dota :)

At this level with the same position and firstpicking ww I get a small subset of heroes I play with and against, so it's much easier to call. I have pudge and sniper every other game for example, and enemy pos 5 is very often cm or ogre :) It's a great day I get to play vs meepo or oracle but it almost never happens. The best I can hope for is axe or legion.

Also half the fancy strats don't work because people don't talk. Split-pushing is just split-feeding. Nobody does 3-lanes. Jungling is a last resort of carry that lost the lane and when he's fat the barracks are already destroyed.

So if we lost laning hard enough and have earlier-game cores than the enemy - we lost, it just takes a lot of time to play out.

But even if I was wrong 30% of the time instead of 1o% - I'd still happily take that trade in mmr if I could. Obviously pro players have different motivation so for them it's "never surrender", but almost nobody is a pro.


Not to mention that the bonus pool feature has long been removed.

More importantly, though, your MMR has now been split up across the three factions that you can play as. If your Terran gameplay is much better than your Zerg gameplay, you'll be facing higher-ranked opponents when you play Terran, than when you play Zerg.

This has made it easier for players to switch their factions up, without being punished for it.


What you said makes me wonder about a totally different way of using a metric of how likely the person is to win a given match.

That is, perhaps the system could be engineered to maintain a more even win/loss ratio so that people don't go on super-long win (or loss) streaks in general by adjusting who they get matched with.

It probably wouldn't work that well towards the edges, but around the middle it might work well enough.


This has what dota 2 does. Every win / loss is worth the same points, the games are extremely close in terms of player rank, so and it's a team game so you get more variance than just your individual skill. Players eventually settle close to 50 percent win rate


The "different rating curve" appears to be the actual historical probability data, not a formula. I think. If that's right then it this estimate of the probability of winning is not a new discovery.

On the other hand I suspect the historical data really is the "best fit" to the historical data.


You have to decide what the purpose of the rating system is. Using it as a reward system for players to feel accomplishment is a different use case then trying to correctly predict the likely outcome of a game.

Personally, if I was designing a rating system, I would use two separate systems.

One would be like the one in this game; publicly viewable, pleases the players, and gives a sense of accomplishment.

Then, I would have a second, internal only rating system that players can't see but is used for matchmaking to make sure people are matched up to players with as close to equivalent skill as possible.


I find similar solutions experientially two-faced and frustrating. Imagine the matchmaking engine was a person behind a desk that you interact with. If he consistently told you one thing and then did something else you'd be displeased.


Some games do this already, and players are very unpleased by the results.


Probably because they get matched to tougher opponents if they're better?


It's kinda disheartening being matched to plat players in your silver promos.


That's what xp is in sc2. For several years xp was the only number that was visible - your true mmr was hidden.

Peple knew this and ignored xp altogether.


My understanding was that the system consists of using the historical odds of winning (given the rating difference). If you benchmark that using only past data, I think it is by definition the most accurate system. (The data is always a better fit to itself than a theoretical fit is.)

Naturally future data is much harder to deal with than past data. But even for future data it's not obvious that ELO (or any other theoretical fit to the odds of winning) will be more accurate than the historical odds.


Yes, the best fit for the data is the data itself, it's a tautology. Nothing wrong with Elo's exponential curve, it just can't beat the actual data.

You raise a good point in that I could've created a training set and a test set, that probably would be a better validation. But I don't know, I'm not doing science, I'm making a game.

On the topic of whether the future matches the past, the predictions were based on a rolling database of the past 100000 matches, which is approximately the number of matches played per 7 days. So my theory is that the data is quite recent and up-to-date and so should match, in general.

Of course I never tested this. In the end, I'm not doing science, I'm making a game. If the retention goes up, complaints are down, then I can't keep working on the rating system, there are 1000 other things to do.


Yeah, I'm not giving advice on how you should do it. I was just unsure whether critics here had understood that measured data is probably better than any theoretical fit, even the revered ELO.


> I think it is by definition the most accurate system

By gum, an opportunity to quibble semantics on the internet. That is true if benchmark using means 'only admit to knowing' and accuracy means 'must be numerically quantifiable given existing data'. It is false otherwise, especially if accuracy means 'conforming to truth' and we have a model for how the numbers are being generated.

Obviously if I generate a set of numbers by sampling a normal distribution then the most accurate model is a normal distribution, no matter what empirical data I use for benchmarking.

That is to say, if we know how the data was generated (sans noise) we can reject empirical distributions as the most accurate, because we can directly know the distribution of the data.


Ok, that is a legitimate ... quibble. Let's assume that we don't already know the correct distribution. In that case we're going to judge each theoretical fit by how close it comes to the historical data. (Or else we're going to get that wrong, which is another common approach.) ELO is much more prestigious and credible than some guy who made a game, but it is less credible than data, for some number of data points N. (Although I think a theory can be more prestigious than data almost independent of N.)


Well, it’s like the question of what is better: a restaurant with 4.5 stars on 4 reviews or one with 4.2 stars on 1,500 reviews?


Sure. If there's enough data then the data becomes more credible than even the most popular theoretical fit. If I have four games I played with my nephew then people should probably go with ELO.


Is the point really predicting outcomes? FIDE (chess) Elo is useful because I can compare machines to humans who have never matched each other.

Generally speaking the "rating structure" is a lattice where you can, for any two players A and B, tell whether A is a better player than B or the other way around. Elo, Glicko, etc. are embeddings of this lattice on the real line (much like the utility functions of microeconomics are real embeddings of preference lattices).


> Generally speaking the "rating structure" is a lattice where you can, for any two players A and B, tell whether A is a better player than B or the other way around.

Not really. You tend to have cycles, where person A can beat person B who beats person C who beats person A.

There's a guy who was even with the guys playing Go who were 2 or 3 stones weaker than me that would tend to beat me because of some of the unorthodox things he did. (Eventually I strengthened my game against these things).

Considering ratings to be a total ordering is a useful approximation.


> Is the point really predicting outcomes?

Others have pointed out how there is a psychological aspect of rating systems, and no developer wants to constantly field complaints. That said, I believe the answer is yes. A rating system derives meaningfulness from its predictive power. In other words, people want to know how good they actually are compared to one another.


I think that for most game players, outside of the top-N group who just want to be at the head of the list, rankings are largely a mechanism to facilitate playing good games, where good is generally defined as close games where both players feel like they could have won.

There's an interesting question about how you rate players who use fundamentally different strategies. For instance in RTS games, should you match boom vs blitz players of otherwise equivalent elo? Or should you instead (try to) construct a classifier to determine which type of player someone is, and then have a rank against each other type of player and match them according to that rank?


As long as you let players play several games vs each other it's fine. If you met this cannon rushing guy already you'll know what to expect.

That's why people use barcode names (llll1l1l1l1l111l) in starcraft :)


For a lot of online games I think matchmaking based on pushing you towards a 50/50 win rate is kind missing the point of games. It gives you fair odds of winning, but it doesn’t necessarily give you even odds of having a fun or competitive game.

At high skill levels players are skilled enough to where it might, but with most online multiplayers the overwhelmingly vast majority of players are lacking in basic fundamentals to varying degrees. At that level, ELO based matchmaking mostly just results in one person getting rolled or doing the rolling. They’re not really competitive games in my experience.


If two players of similar skill general roll one another, that's a game design problem not a rankings problem.


Elo is great for what it was built for: ranking chess players. Chess is (1) extremely low-variance, (2) has an extremely high skill ceiling, and (3) is 1-on-1. Elo works great for chess, but it would never work for something like Poker. Let's briefly go over these three points.

Most games aren't chess -- where the only variance is picking who's black and who's white -- in fact, they might include dozens of RNG mechanics (from critical strikes to ability rolls, to spawn points). These mechanics (while fun and well-designed) might pollute your "idealized" model. There's also the problem of RPS (rock-paper-scissors) mechanics or pick-counter-pick mechanics which will also heavily skew win rates. For instance, given a slow combo Magic deck, you will most likely auto-concede to mono red aggro (regardless of skill level). If you're using Elo, this will pollute your model. (Hint: you shouldn't be using Elo.)

Most games also don't have chess' high skill ceiling. Chess has such a high skill ceiling for a number of reasons -- it's one of the oldest games still being actively played, for one. Suppose your "game" is simply the flip of a coin (everyone wins 50% of the time). Zero skill involved. Trying to model win-loss-ratios using a sigmoid curve is silly. Obviously, no game is going to be a coin flip, but there's a world of difference between chess and DOTA.

TruSkill attempts to fix (3) by using clever Bayesian updating on a player-by-player basis[1] but in reality, it's a shit-show. Using Elo (or variants thereof) for team-based games where the team isn't really a team (more like 3-5 random people plopped together for one match) is incredibly misguided, but continues to be implemented in just about every modern multiplayer game (to the players' frustration). Of course, mixing and matching pre-made groups with non pre-made groups creates as many issues as you might imagine.

In short, why so many game devs are enamored with Elo when it comes to ranking is a bit bizarre.

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/...


My wife was a champion table tennis player. This sport uses Elo as well, and I know from watching the sport over time that the rating system has real problems. It doesn't suffer from the weaknesses that you cite, but even so, the problem of "rating inflation" is widely discussed.

It seems that much of the problem comes from rating points brought in by newbie players (and note that, contra TFA, the problem isn't with experienced players losing to newbies, but the opposite).

A newbie is started off with some nominal rating; I forget the number, but let's say it's 800. Most likely that newbie is going to lose his first matches, and some proportion of those newbies will get frustrated and quit. For the ones that stay in the game, things probably work out in the long run. But for those that got discouraged and quit, in the course of their loss they caused a few points (not many, because they're likely way overmatched, but definitely more than 0) to be credited to their opponents. When they quit the sport, they're never going to reclaim any of the rating points that they lost initially. But those points are still in the system, having been added to their winning opponents.

It's hard to quantify because the Elo system is the only objective comparison we have, but over the course of the almost 30 years I've been watching my wife play, the Elo rating enjoyed by a player of a given hypothetical skill level has increased dramatically. Many are saying that for someone of the upper echelons, their rating is maybe 200 points higher than it would have been 30 years ago.

So back in 1991, my wife was in the top 30 women in the USA with a rating in the mid-1700s. Today, someone with that rating isn't even going to be in the top brackets of serious tournament.

Despite all that, the usefulness of the rating system keeps it in use as a valuable tool. It seems that the ability to match players who have never seen each other before, ensuring interesting matches, is part of keeping the game competitive for those in it. And table tennis is also, because of this, one of what I believe is few sports where men and women often play head-to-head (even though men generally have much higher ratings, on account of the sport requiring far more strength than you might suspect).


I don't think there's an expectation that a skill rating is comparable throughout 20 years, because both individual players and how the game is played (the meta) changes continuously.

But if that's true, then why would rating inflation be a problem?


The game itself has not changed, so it still makes sense to compare players across time. It would be nice if we had a quantitative way of doing this; so we can make statements like 'the average proffessional player today is better than 20 years ago, a typical modern pro would win 60% of the time again one from 20 years ago).

In some sense, it is not surprising that we do not have a system that accomplishes this. Since it is impossible to see the results of a game between players living in different time periods, we cannot get any data to prevent drift. You can still try to normalize the rankings. However, unless you have some independent way of measuring skill, you would need to make an assumption about the relative strength of players. Assuming the average skill of a proffesional is constant across time is probably not accurate, but closer to reality than what you get with unchecked inflation.


You can sort of solve the inflation problem by zscoring the elo. Now a person's score will tell you how much better or worse they are than the median player, assuming an underlying normal distribution (reasonable).

Of course, scores will only be comparable if the average skill of all players remain constant. I would imagine this isn't true, but the drift over several decades is probably small.

Unless you start introducing some purely objective criteria for skill, which can never work, this is the best you can do. It's still way way better than a straight elo system though.


Rating distributions are often not normal because some subset of players study the game and take it more seriously resulting in a bimodal distribution. See [0] for an example in Chess.

[0] https://chess.stackexchange.com/questions/2550/whats-the-ave...


Even without the bimodality, you wouldn't expect a normal distribution of ratings.

1. Assume that chess ability is normally distributed in the population.

2. Assume that people who are terrible at chess are more likely to stop playing chess than people who are successful.

Then you've sampled the underlying normal distribution mostly from the top end, and that new, highly skewed distribution is what you'll see when you measure everyone's rating.


That's fascinating, thanks! It looks like you can model it as a mixture distribution made up of two underlying normal distributions.


The idea that chess has not changed in a long time is simply not true. Two huge and relatively recent changes were the addition of chess clocks and premoves.


And aside from the mechanics of how the game is played, there have been massive changes in the popularity of chess (first massively upwards, recently possibly down slightly), as well as how analyses are done.

It would be very difficult to account for these factors in a way that keeps comparisons across 30-year+ time spans meaningful.


the game itself has changed quite a bit, and the number of people playing it, and the dominance that they achieve has also gone up quite a bit.


This might not be great for a sporty-sport, but I think that for a video game this would actually be an advantage. This kind of a rating inflation would mean that long-term players would see some numerical progress without really doing much better.


It would also inflate the ratings of people new to the game later in time.


A newbie is started off with some nominal rating; I forget the number, but let's say it's 800. Most likely that newbie is going to lose his first matches, and some proportion of those newbies will get frustrated and quit

That seems like a simple problem to fix. When somebody quits, just subtract 800 points from the remaining ranked players, scaled accordingly such that their relative win probabilities remain the same.

Of course, the other issue is if the number of active players increases over time. In that case, it's not so easy to fix unless you start scaling down the number of starting points given to new players.

Perhaps a better thing to do would be to construct a model of the rating inflation over time and use that to correct for historical comparisons. It's still not particularly meaningful though, because you have no way to measure actual skill inflation.


You don't have to formally quit the game to stop playing. I played one ranked chess tournament in high school, quit for ten years, and then picked it back up. What would you do with my points?

If you choose to delete them, that means that everyone will have constantly eroding ratings unless they keep playing.


Your points could be added back in when you resume playing. There’s no reason to throw the data away.


> It doesn't suffer from the weaknesses that you cite, but even so, the problem of "rating inflation" is widely discussed.

Ah yes! Inflation is also a problem I've seen in competitive online games. Rating inflation was a serious issue with World of Warcraft PvP arenas circa 10 years ago (iirc Blizzard hard capped arena ratings at 3000 during WotLK). I don't follow chess much, and I'm not exactly sure how chess avoids it (or even if it does).


By the point you're playing ranked matches in chess, you're generally invested enough to keep playing. However, chess has a (statistically) significant inflation problem, to the point where you can only compare scores within the same decade or so meaningfully.


It seems there was a lot of rating inflation in chess, but at the top level, at least, it's stopped - the number of players over 2700 has been pretty constant for 5-10 years, a few dozen players. In 1990, only Kasparov and Karpov were rated over 2700.

https://2700chess.com/

https://en.wikipedia.org/wiki/1990_in_chess


There's also an inherent deflation effect. Players tend to get better over time. In the simplest case, if we start with a pool of players rated 800 and let them play for a year, at the end they'll be better players but still rated 800 on average.

Most chess Elo systems have an inflationary component where young or new players (who are overall faster improvers than the player pool at large) gain and lose points faster than established players (in detail, either using performance ratings or increased k-factors or both). In a balanced rating system, the sources of inflation and deflation are roughly equal. You can tweak the parameters to keep it this way, though it's not trivial to tell whether there is "real" inflation over the years or whether players are simply playing better - or indeed, what's the difference.


Why don't they increase the bar for newbies to get into such a system?

If they know that some people just play a few games and then quit, let's say they only can get Elo when they played a specific amount of time or won at least n games etc.


There is a minimum of 10 games before people start being ranked. People who quit early don't get ranked. People who have played 10 games gain a new long-term goal.


all of this, plus an additional observation that i've had about games w/ tiers/divisions: player skill is assumed to be normally distributed when that is just so demonstrably not the case-- there is a fairly high skill floor to be able to play the game at all, and the right tail (high skill) of the distribution is WAY fatter than the left.

especially with well-established, popular games-- Chess, League of Legends, Overwatch, etc. (where there is even a financial interest in being a top player to boot), the skill levels of the people at the absolute top simply profoundly dwarf players that would even superficially seem "comparable" by the standard of being in adjacent, or even within the same tier.

in League of Legends, for example, it is often claimed that the differences between players in high/low Challenger, high/low Master's, and even high/low "high diamond" (low d2 vs high d1) all constitute distinct "tiers" of player quality that are as substantial as the full-tier jumps closer to the median (e.g. silver -> gold, gold -> platinum), but because of this shoehorned prior about skill distribution it leads to this compression at the very top.


> player skill is assumed to be normally distributed when that is just so demonstrably not the case

This is close but not exactly right, and the small difference matters. Elo does not assume that skill is normally distributed, but rather that "quality of play" in a single game is normally distributed around some average quality level for the player. Obviously this too is an approximation but it's a much smaller one.


hmm, interesting. i did mean to say that this is a problem more in the context of games that add tiers/divisions to their ranked ladders, but i hadn't really thought about elo making assumption about the normal-distributed-ness of player deviation from their "true" skill level. does that not just fall out directly from the Central limit theorem (given the taking of large #s samples (game W/L observations vs. predicted P(win|my elo, their elo)) of means, etc.)?


“player skill is assumed to be normally distributed”

I would think player skill level (at best; there easily can be cases where P typically beats Q, Q beats R and R beats P) is an ordinal (https://en.wikipedia.org/wiki/Ordinal_data), so one can’t say “player P is twice as good as player Q”, “player P is as much better than player Q as player R is better than player S”, and certainly can’t prove or disprove whether skill is being normally distributed. It is customary to assume that, though.

Also, if one assigns numbers to skill levels, those can be normally distributed. It probably is possible to design an ELO-like system that, given enough games, guarantees that the set of skill level numbers of all players approaches being normally distributed.


Another thing to consider with a lot of these games is that they're not static. The game changes and this can boost one player's rating up when their preferred champions/heroes/whatever are strong at that time. Even if the game didn't change, there are so many different characters that's play differently enough that the player's results with them could end up at a rather different rating.


Here's season 13 of rocket league. Free red delicious apple for the first person to correctly identify the shape of the curve:

https://2p1ipt36o1g23g1tt6ba5nou-wpengine.netdna-ssl.com/wp-...


The X axis appears to be an ordinal number, some kind of proprietary rank. How much sense does it make to talk about the shape of a distribution over ordinal numbers? If we converted those to cardinals, the shape of the new, reality-based distribution could be pretty much anything.


To expand on this point, I recently made histograms of gaokao scores for several Chinese provinces. You can see Beijing data here: http://www.gaokao.com/e/20180623/5b2e13c43951c.shtml

This report is nicer than many others I found in that scores are reported all the way down to 0, rather than cutting off at the threshold for university admission. It's much less nice than some in that, below the university admission threshold, scores are reported in brackets of 10 points rather than by individual score. (This, even though the document is called "一分一段表"...)

Anyway, I imported this data into python and plotted it with matplotlib. The histogram you get from this is obviously, wildly flawed -- the ten-points-wide bar from 410 to 419 is also 710 people tall, dwarfing the actual mode of the distribution. To correct this problem, you need to divide the count for bracketed scores by 10 (the width of the bracket) -- the 710 people scoring 410-419 are 71 per score in that range, very comparable to the 70 people scoring 420, but not to the 182 people scoring 548.

Without knowing the width of the rocket league rank brackets, that picture of the population of each rank doesn't tell us anything -- at all -- about the shape of the distribution.


Maybe someone who plays the game could clear it up. I did some searching, and it appears as though the ordinal ranks on the x axis are just buckets of players in ranges of 25 points in each bucket. The underlying rating system for these rating points is apparently something like Elo or Glicko, but I couldn't find a source that explicitly says what.


looks gamma, but just like... slightly gamma. i guess my contention would be that it "probably" should be smushed/redistributed with even more mass on the right tail, but i couldn't tell you from personal experience whether that's true, as i'm pretty terrible at rocket league. i will say that the top Rocket League players (to my untrained eye, and jaw on the absolute floor) may have an even higher z-score than top players than any other game(s)... but rocket league is kind of unique in it's being a remake of a game that i guess a lot of (the same) people used to play.


> Most games aren't chess -- where the only variance is picking who's black and who's white -- in fact, they might include dozens of RNG mechanics (from critical strikes to ability rolls, to spawn points). These mechanics (while fun and well-designed) might pollute your "idealized" model. There's also the problem of RPS (rock-paper-scissors) mechanics or pick-counter-pick mechanics which will also heavily skew win rates. For instance, given a slow combo Magic deck, you will most likely auto-concede to mono red aggro (regardless of skill level). If you're using Elo, this will pollute your model. (Hint: you shouldn't be using Elo.)

None of which matters? All that means is that the results of individual games are a bit higher variance. Elo handles that by design. If you lose a certain proportion of Magic games to less-skilled players then this should be considered a reflection of your skill, because the only reasonable definition of skill at the came is the rate at which you actually win it; anything else can be gamed and so should be ignored.

> Most games also don't have chess' high skill ceiling. Chess has such a high skill ceiling for a number of reasons -- it's one of the oldest games still being actively played, for one. Suppose your "game" is simply the flip of a coin (everyone wins 50% of the time). Zero skill involved. Trying to model win-loss-ratios using a sigmoid curve is silly. Obviously, no game is going to be a coin flip, but there's a world of difference between chess and DOTA.

That's also something that Elo handles just fine? If every game is a coin flip then everyone will end up with the same Elo. If player A has x more Elo points than player B, then they win y% of their games. If your game has a skill ceiling where even a complete beginner always wins, say, 20% of their games, then that just means no-one will ever be able to rise above a corresponding Elo rating.


> That's also something that Elo handles just fine? If every game is a coin flip then everyone will end up with the same Elo. If player A has x more Elo points than player B, then they win y% of their games. If your game has a skill ceiling where even a complete beginner always wins, say, 20% of their games, then that just means no-one will ever be able to rise above a corresponding Elo rating.

That's not how it works. The distribution you end up with will not be uniform, it will look like this (just ran Elo with a coinflip; 11 players, 1000 matches): https://imgur.com/9O82pRj

On the long term, I think this will tend to a geometric distribution with a low p value.


Show your working?

If you're matchmaking players against equal-ranked players, then each match is just +/- 50 points, you'll get a binomial distribution which tends to normal as n gets large (assuming a large player pool so each player's results are independent). If players play players with different ratings then that will tend to push their rating back towards neutral. You certainly don't get a geometric distribution because the rating algorithm is completely symmetric.


> each match is just +/- 50 points

This only happens in the rare cases where you're matching players against (exactly) equally-ranked players. You can mitigate this by always trying to match as "close as possible," but it's only a mitigation. Try simulating random matchmaking with Elo, and you'll get something like this: https://i.imgur.com/1Y08jUB.png (1000 players, 100,000 games). In my simulation, I set k (the Elo constant) = 50.

I think it's going to tend to a geometric distribution for reasons discussed here (which is another interesting and non-intuitive result): http://www.decisionsciencenews.com/2017/06/19/counterintuiti...


> Try simulating random matchmaking with Elo

I will. I was hoping you'd post the actual simulation details rather than more unlabelled graphs.



> with a custom k-value of 50

So you've patched this library somehow? Because when I run your code I get a result that's just full of 0 ratings.

But in any case I'm not at all convinced that your charts don't just show the normal distribution that we'd expect, just in some weird way. (Did you test your plotting methodology against some simpler rating system before using it to draw conclusions about Elo?). Plot a normal histogram, or a density plot if you're feeling fancy: https://towardsdatascience.com/histograms-and-density-plots-... . I'm betting the result is just the bell curve that we'd want and expect.


> So you've patched this library somehow?

Yes, as mentioned, I set the k-value to 50 on this line: https://github.com/HankSheehan/EloPy/blob/master/elopy.py#L8...

Author decided to do something fancy which will only work when number of players is less than 1/2 * starting Elo rating.

> But in any case I'm not at all convinced that your charts don't just show the normal distribution that we'd expect, just in some weird way.

As mentioned, you end up with a geometric distribution. I covered a similar phenomenon in a blog post I wrote last year[1]. See Theorem 3.3 in this paper: https://kconrad.math.uconn.edu/blurbs/analysis/entropypost.p... But in short, the geometric distribution has maximal entropy over (0,∞) given a known mean (in our case, the mean will always be 1000).

[1] https://dvt.name/2017/07/10/confusing-math-with-morality/


> As mentioned, you end up with a geometric distribution. I covered a similar phenomenon in a blog post I wrote last year[1]. See Theorem 3.3 in this paper: https://kconrad.math.uconn.edu/blurbs/analysis/entropypost.p.... But in short, the geometric distribution has maximal entropy over (0,∞) given a known mean (in our case, the mean will always be 1000).

Another reply already told you that's irrelevant to Elo, because Elo can go negative (and if it couldn't then the mean wouldn't always be 1000). It's probably going to be normal, and drawing an actual histogram of a simulation like yours comes out looking pretty much like a bell curve: https://imgur.com/YBDp4uI .

As far as I can see none of your claims about Elo stand up. Why do you think you've shown the things that you're claiming?


Elo can be negative, so this doesn’t apply.


>Why so many game devs are enamored with Elo when it comes to ranking is a bit bizarre.

Probably just Occam's Razor, they don't know of or care to make something better and can just pull Elo off the shelf.


Another example would be competitive overwatch where the developer's stated goal was an equal distribution of rated players throughout the various ranks (bronze/silver/gold/platinum/diamond/masters/grandmasters). They tweaked the variables until they got their desired distribution, lumping the majority of players in gold and plat. Ranking up became an exercise in either playing hundreds of hours or starting a brand new account with fresh MMR.

Predictably this led to an explosion in boosting and win-trading services.


> the developer's stated goal was an equal distribution of rated players throughout the various ranks

and

> They tweaked the variables until they got their desired distribution, lumping the majority of players in gold and plat

seem to be opposites. What am I missing here?


It also probably doesn't even matter since the main problem is players intentionally losing to rank down and then smash lower rank players.


I also suspect that Chess is exponential because of the "one mistake and you die" nature when playing good players.

Ben Finegold (a Chess Grandmaster) talks about this all the time--"The reason why I'm higher rated than you is that I can play 100 moves without a major mistake and at some point you will hang a piece. The reason why Magnus Carlsson is rated higher than me is that he will play 100 moves that are slightly better than mine and I will lose."


also, what system(s) do you prefer / know of that handle multiplayer matchmaking well? it seems to me that a good system might be necessarily game-specific to some extent, although i'm sure the state of the art is much better than what i've experienced gaming to date xD.


> also, what system(s) do you prefer / know of that handle multiplayer matchmaking well?

None, and actually I don't think it's particularly healthy for the game. For example, I had plenty of fun casually pubbing Counter Strike in the early 2000s. When I wanted to take the game more seriously, I made a team and joined a league which might include group play, single/double elimination, and exhibition games. Actual competitive play (scrims, matches, tournaments) is fundamentally different than what today we call "matchmaking."


yeah, that strikes me as a pretty fair proscription, unfortunately-- the skill gap from coordinated team play in any team game makes it to where teams that play often together are matched against ad-hoc teams of individually more skilled players to make things "balanced", which were it even be possible to do this in the "50% win probability for each team" sense still leads mostly to unfun matches one way or the other. and, of course, queueing with friends you don't play with often, or with high skill variation amongst them just completely screws you from a balance/rank perspective (but hey, at least you get to lose together with all your friends! :).


Do you believe multiplayer games would be better off without matchmaking at all?


Yeah, 100%.

I'm actually flabbergasted why you can't make a team in games like CSGO or Overwatch. And then play in tournaments or matches (against, you know, other teams). It makes no sense to have individual matchmaking in a team game. Game devs create this individual matchmaking system (which is paradoxically taken seriously by casual players, but totally ignored by actual competitive players), and the community and other organizers (enter FaceIT, ESEA, etc.) have to actually set up leagues, tournaments, and events.


In my experience the reason behind devs loving matchmaking is fairly straightforward: being able to solo queue raises engagement. It takes time and effort to make a team in the first place, more time and effort to coordinate games when you're now schedule wrangling n other people, and that extra effort is magnified across all the teams participating. In contrast, hopping into soloqueue is so brainless that the hours spent playing soloqueue end up dwarfing the hours spent playing as teams. Will some people who care enough still play team mode? Sure, but if solo matchmaking is an option it becomes the default simply through being the most-played mode. At the end of the day, devs seem rationally interested in juicing engagement numbers for the vast majority of the playerbase and letting those serious enough to care about not pugging figure it out for themselves.

I think there's a pretty limited space for games that don't compromise on various aspects of design (matchmaking, mtx, etc) with the explicit goal of making a better top-end competitive ecosystem. I'd personally love to see a competitive team-based game without any form of solo queue, but I'm skeptical it would do well in the market. It's almost like Facebook engagement-doomscrolling vs. a mailing list: the format of the latter means there'll probably be better content, but a whole lot more people are going to be hanging out on the former. At least mailing lists don't have to recoup development costs.


> being able to solo queue raises engagement.

This is (unfortunately) probably the answer.

> I'd personally love to see a competitive team-based game without any form of solo queue

I'm okay with solo queue, as long as I can also have a team queue where I could play in traditional seasons or tournaments with a team of friends. It just seems odd that one needs to go outside of the game itself (to ESEA or what-have-you) for this feature.


I see your standpoint. I don't see it happening from a practical / financial perspective though. Being required to have the right number of same skilled friends ready is quite a high entry bar to playing a game.


> It makes no sense to have individual matchmaking in a team game.

I kinda feel broadsided by your rather extreme views here. Later on in this thread you say, okay, solo-queue is fine but you need a way to make teams and join tournaments, so it's also not really clear what you think.

Single queue exists because team games are still fun in pick-up groups. Go to any basketball court and you're going to find guys playing pick-up groups of basketball. I don't hold a 5+ basketball team in my pocket, and that's okay. Because playing with strangers in a team-game is still fun. And sometimes even more fun because you're meeting new people and playing with new team dynamics -- solving new human team dynamics on the fly is an underrated fun part of team games. Single queue matching exists because rank gives people a stake in the game and they take it seriously, and it makes the ranking system accessible, and it's fun.

A game that only offers tournaments and requires you to come with a pre-built 5-man team is just a game that excludes most people. The people forming teams for tournaments is the 1% of the gaming population.

I want to come home from work and play a couple CS:GO games with others who will take the game seriously. I don't have time for a tournament. I don't have a team. I don't want to join a no-stakes casual game where people are putting the controller down to answer the front door or just disconnecting. Without ranked-solo queue, what system do you propose for this common use-case?


> Single queue exists because team games are still fun in pick-up groups.

Single queue is fine, I just don't think the "ranked" aspect of it is healthy for the game.

> I don't want to join a no-stakes casual game where people are putting the controller down to answer the front door or just disconnecting.

Maybe not disconnecting, but trolling and just generally being a pain actually ends up being what happens all the time even at high solo queue tiers (last year I had two accounts at Global Elite). ESEA and FaceIT have much more robust pugging systems put in place so that's why people take it more seriously.

But my point is that even though I'm a very competent Global Elite player, my Counter Strike heydays are behind me and if I were to seriously play against even a semi-pro ESEA-Main (or probably even Intermediate) team, I'd get absolutely destroyed. So solo MMR is a pointless metric to have, and just adds toxicity to your game.


this would probably work better if, in the case of Overwatch, the teams weren't six players (i personally have always felt like the game would be better @ like 4v4 anyway, because of how god damn frustrating dealing with 5 random players on your team every game is in a game that is balanced purely around teamwork and inability to solo carry w/o being much more skilled than everyone else in the game)


You can do this in virtually all games. But requiring you to do so, and failing to offer skill based matchmaking is not what most players want.


> You can do this in virtually all games

No you can't. Overwatch, CSGO, etc, etc. don't have a way to make a team and queue as a team (against other teams). You do this by playing on FaceIT, ESEA, CEVO, or in other leagues. Built-in matchmaking is only individual. This is, from a competitive standpoint, a meaningless data point and (from a casual standpoint) only creates toxicity.


>No you can't. Overwatch, CSGO, etc, etc. don't have a way to make a team and queue as a team (against other teams).

I don't think this is true. I play Overwatch and I sometimes play with anywhere between 1 and 5 other players as we have arranged to group up before looking for a game. With 5 other players, it's a 6-stack, and I believe that a 6-stack will always be matched against another 6-stack. As far as I know, it takes the average skill rating of your group and finds another group with a similar average skill rating to play against you.


Overwatch, CSGO and virtually all other team based FPS allow you to queue solo, as group, or a full 5 person team. This is outside of a specific league. There are dedicated LFG sites for different games to help find groups ahead of times. Generally you will be matched against a similar team, and different games use some form of skill based matchmaking, but depending on how many players there are, what modes, what region you are in, as a solo player you could be matched against a premade or vice versa.


I am curious what you mean by matchmaking is only individual, it is common to party up and queue as a 5 stack, both in csgo and valorant. now when you have a bunch of solo qs playing against a 5 stack, the actual team is going to win 9/10 times...

I do miss the old days of CS with "clans" where it wasnt so hard to join up and have a lose group of people you played with regularly and got to know ~20-30 people and whoever was on would join up to play together (maybe this still exists, but I havent found it..)


> I am curious what you mean by matchmaking is only individual, it is common to party up and queue as a 5 stack, both in csgo and valorant. now when you have a bunch of solo qs playing against a 5 stack, the actual team is going to win 9/10 times...

That's exactly the problem. The MMR system isn't based off of team ratings, but off of players. Otherwise, teams (e.g. 5 players) would always play against other teams (another 5 players). Now, even ignoring the model problems this generates (and the gymnastics that something like TruSkill does to mitigate it), it's just a bad experience.

For example, if I go to the beach and join some random volleyball pick-up-game, I'm expecting that the purpose of the game is to "have fun." If I'm joining a team to play in a rec league, the expectation is to try and win. The idea of "matchmaking" mixes these two concepts, so you end up having different people with different expectations. Some are going to say "why are you trying so hard" while others will retort "why aren't you trying harder?" This misalignment of expectation is, imo, the chief cause of toxicity in (competitive) video games these days.


In your magic example you seem to be arguing that which kind of deck you pick is not part of your skill, which is of course totally incorrect. Picking "fun" decks over "obvious/OP" decks means you're worse at winning games. Or at least that you generally play with a handicap, which is easy to account for in elo.

To your coin-flip example, if you model a league in excel you'll find that elo actually results in a rank distribution very consistent with what your intuition would expect (given enough players and enough matches, of course).


> To your coin-flip example, if you model a league in excel you'll find that elo actually results in a rank distribution very consistent with what your intuition would expect (given enough players and enough matches, of course).

This is incorrect. If you simulate Elo with a coin-flip, you'll get something that looks like this (11 players, 1000 matches): https://imgur.com/9O82pRj -- I think this will tend to a geometric distribution (not sure what the p is though, probably depends on the constants).


>> given enough players and enough matches, of course


Here's 1000 players with 100,000 matches: https://i.imgur.com/1Y08jUB.png

Feel free to try simulating it yourself, but even mathematically it makes no sense to end up with a uniform distribution as we tend to infinity.


Elo, not ELO, after Árpád Élő.


Correct, fixed :)


1. The sigmoid function is the closest thing to linear that makes sense on probabilities⁺. A purely linear function would cross 0/100% which, while the sigmoid flattens exponentially as it approaches the extreme values.

2. The fit isn't as bad as the author claims. It looks like the biggest difference between the graphs is that the point differences are scaled differently (400 pts for 90% in elo vs 800 pts in the second graph).

A quick and dirty overlay of the two graphs shows a reasonable fit: https://ibb.co/0YwYH9z

3. I like observations about player psychology. Satisfying the players is more important than having the mathematically best ranking system.

4. Personally I like Whole History Ranking (https://www.remi-coulom.fr/WHR/), but it's unlikely to be popular with players (the psychological criticisms the article makes apply to it as well, with some additional problems, like rank drifting without playing). KGS which uses ranking system similar to WHR (but more primitive) certainly draws a lot of criticism for its ranking system.

If I had to design a mathematically optimal ranking system, I'd start with WHR and make parts of it trainable/fittable.

----

⁺ Bayes' theorem turns into addition when applied to logarithmic probabilities and the sigmoid function converts from logarithmic probabilities to normal probabilities. This property is why it (or its multi category equivalent softmax) is used when predicting probabilities using logistic regression or neural networks.


Creating a custom system to suit your situations needs sounds great and the thought process was fun to read, but some of the claims lobbed here are pretty questionable.

Specifically, the claim that Dota's matchmaking system is "probably wrong" because the model chosen doesn't match your own findings feels like a reach. Sibling commenters have pointed out how skill variance is important to allow the ELO system to function in games like chess. Additionally, someone else pointed out that the sigmoid function is similar to a linear funciton close to zero.

It seems at least as likely that Acolytefight doesn't have a high enough level of skill expression present in the game to see top players "curve out" weaker players, rather than exponential functions mapping player skill to be useless or wrong.

Does elo suck? Maybe, but this hasn't convinced me.


Elo might or mightn't suck (imo it's a great ranking system). But the article sucks. Vanilla elo is built around chess and some adjustments to the scale and/or K-factor might be necessary to fit the circumstance. A quick change of scale to E = (1 / 1 + 10 ^ ((Ra - Rb) / 800)) and all of a sudden ELO very accurately reflects the games actual results: https://imgur.com/a/rFP5U0g

Meaning just that skill is a weaker factor in this game than in chess...

Edit: The 'actual' curve includes a correction for the obvious anomaly of ~55% win expectation at 0 point delta.


I remember a bit back the Go server that I play most of my go these days [OGS](https://online-go.com) changed their ratings from Elo to Glicko-2.

You can read their rationally for it in this forum: https://forums.online-go.com/t/ogs-has-a-new-glicko-2-based-...

The key takeaway is this:

> Most of the shortcomings [of Elo] can be traced back to the fact that the system is too slow to find a player’s correct rank, and too slow to adapt when jumps in strength occur.

> The problem of slow moving ratings is a well-known problem with Elo implementations. In response to this, Prof. Mark Glickman developed the Glicko, and later Glicko-2, rating systems which address this problem very well and are fairly widely used

A few weeks ago they then made an update to their implementation of Glicko-2, where—during the announcements they summarized many interesting statistics on how the system has panned out for them: https://forums.online-go.com/t/2020-rating-and-rank-tweaks-a...


Wow, I wrote this article ages ago, didn't expect to see it posted here today.

I just want to clarify the point of the article:

Why would you fit a curve to the data when you can just use the actual data?

That's the point of the article.

We're in the age of big data, we should use it to make better win rate predictions. Elo's exponential curve is fine, it's approximately right, it's just now we can have databases of millions of games and we can just do better. Elo was invented before the big data age and it is limited by that.

That's all I'm saying.

I shouldn't have included all the other stuff in the article, it just distracts from the point.


Thanks for writing the article and sharing your work with the world, I really enjoyed it! I think the central point you make is very interesting.

I'd be interested to know what fit you used for the red "line of best fit", why not a straight line? My main question here is do you actually expect a player ~210 points above another to win _less_ than if they were only ~190 points above? (the first dip in the red graph)


If you're interested in evaluating and rating/ranking agents, it might be worthwhile checking out DeepMind's multidimensional Elo rating system (https://arxiv.org/abs/1806.02643) which attempts to solve some of the issues with Elo and Glicko. Most notably, the ability to handle non-transitive interactions (like rock, paper, scissors) and the presence of redundant duplications of matches that might erroneously inflate ratings.

Shameless plug, I've created an R implementation of it here: https://dclaz.github.io/mELO/


This is fantastic, thank you for bringing this up.


I'm curious about whether the author tried to optimize Elo's K factor. It's often left at 32, which is not reasonable for all contests. It's essentially related to the standard deviation of player skills: if there is a large range of skills, it should be large, and if there is a small range, it should be small. It's easy to tune by optimisation, and it has a huge effect on predictive ability.


The more obvious solution is to bring back custom lobbies and private servers and forget about ranking players at all. Gets rid of a lot of bad behavior too because servers can police their own communities and players won't get frustrated when a crappy teammate is dragging their ranking down


idk that makes extremely hard to find matches in games with a smaller player base

see war thunder, the simulation queue is a desert, high tier ships a wasteland, unless all the available player get forcibly lumped together matches will just not happen

compare with stormworks too, most servers are empty in my timezone and the populated one as password protected or spawn limited, it wouldn't take much to get known and partecipare in their community but for working games the time commitment is simply impossible.

same with arma3 I'd love to get into shack tac but timezone and commitments make it unavailable to me, and since most of the good players are sucked up in teams the public server are a mess of "what's left" of the community


Matchmaking without custom servers/lobbies makes finding a match even harder, since a minimum amount of users in a specific ranking/skill level/ship tier/whatever must all be online and searching for a match at the same time. Custome servers and lobbies allow just one or two players to start, and it advertises to other players that they are available to play. The initial players just need to wait until more people show up, and can play more casual game modes or with bots or whatever until more people arrive.


The purpose of ranking systems is to try to create fair matches.

I think ideally you wouldn't show the ranking to the player, just use it to create the match.

With a large population, everyone should end up winning about half their games. That would be the sign of a successful ranking system.


You're conflating ranking and matchmaking systems.

A ranking system measures relative skill/performance levels. You can have a ranking system without using it for matchmaking.[0]

I don't agree that a typical 50% win rate[1] indicates a successful matchmaking system. For one thing, creating fair matches is _a_ purpose of a matchmaking system, but not necessarily the _sole_ purpose. For another, that people win half their games on average says nothing about how fair the matches were.

I think that fairness often gets prioritized over fun. Playing sports should be fun at all levels, but it's particularly important at the lower skill levels that the participants enjoy themselves. That's how sports grow and become cultural institutions. Being a low skill player in a silo of other low skill players is a decidedly un-fun experience that drives a lot of new players away from e-sports. A ranked matchmaking system could be designed with the express purpose of helping low skill players have fun and naturally develop into average skill players.[2] I wonder what such a system would look like.

[0] See FiveThirtyEight's Elo ratings for NFL teams: https://projects.fivethirtyeight.com/2019-nfl-predictions/

[1] However that's measured.

[2] Under such a system fairness might be relegated to the seeding process for tournament play.


> I don't agree that a typical 50% win rate

Assuming there are no ties, and teams have an even number of players, a multiplayer competitive video game is going to be a zero sum game; for every winner, there is going to be a loser.

While I agree that you it isn't the SOLE purpose of a matchmaking system, I do think a fair matchmaking system will end up with most players having a 50% win rate (with a few people at either ends of the skill spectrum having lower or higher win rates). If you are winning more of your games, you should play better at better players until you start losing again (and vice versa). You should eventually hit an equilibrium where you are playing people you have about a 50% chance of beating.


most ranking put player in an oscillating loop around their equilibrium, often that 50% win rate is not because a 100% fun matches but because a sum of 50% unwinnable frustrating matches and 50% easy boring matches

measuring win rate does shit like that

and then most system double punish you for quitting the unwinnable matches

that metric is possibly the second most frustrating aspect of online play


In certain communities, players will choose ranked vs. unranked almost always. I agree that a ranked + custom lobby model should exist though.


> If we take a top-level player, and make them fight a high-level, mid-level and low-level player repeatedly until we can become statistically confident of their win rates against each, there is no reason why their win rates would fit an exponential curve.

When I first read this, I thought to myself "well we get to pick the scores, so it's exponential by definition". The problem becomes more clear when you express it without any reference to the scores.

If Player A wins 80% of the time against Player B, and Player B wins 80% of the time against Player C, how often does Player A win against Player C? This is a question purely in terms of observables. Elo makes a prediction here (94.1% of the time) and it can be either right or wrong. If it's wrong, then there is no valid assignment of scores.


Isn't a qualitative system possible? It would be really complex to create for a game such as dota2 or cs:go, but maybe not for a simpler game. I will give cs:go as an example only because I know it very well.. It would be possible, I believe, in theory, to measure player knowledge towards specific ingame-skills. New cs players for instance wouldn't know how to control recoil effectively. And 100% of global elite/pro players would be above a certain threshold regarding recoil control. On the other hand, you could say with a lot of confindence that a player that tries to achieve a high ground pressing only +jump multiple times with no success, when he would need a crouch jump instead because of height, is a noob. Elo or something similar could then be used to measure ranks within specific clusters only. And some form of mood/form on top of this, to allow for better experience (even though I have played cs for 20y now, it could happen that I abandon the game for a few months, or that I have a really bad focus because of external events).

I'm not sure if this makes sense, but what I know for sure is that as an experienced player, I can watch a player play a single game (sometimes a few rounds), and access his average rank/skill level with high confidence, with no need of information from his prior games whatsoever, or detailed statistics of his gameplay.

There's something else to remember for high skill-cieling games: winrate is not what really matters. A lot of times I will play a very good, balanced and fun game and lose. Sometimes it will even happen with very uneven scores like 16-5 or soomething...


I am pretty sure the author is describing a well understood limitation of Elo, they just need a tiny bit of connecting to models.

Elo can be thought of as an approximation to item response theory models [1]. These describe skill as normally distributed, and whether one person will win using a logistic function (not exponetial).

I think what the author has keyed in on is that afaik in simple Elo there is no slope coefficient for the logistic, but in general IRT models there is (called item discrimination). So in Elo you can't learn that flatter curve they show.

[1]: http://hvandermaas.socsci.uva.nl/Homepage_Han_van_der_Maas/P...


The "newbie suppression" mechanic doesn't make much sense to me. If you play against someone substantially lower in rating than you and lose, shouldn't you lose a significant amount of points? After all, you lost to someone you should have easily beaten.


I agree, and the proposed solution which is to limit point gains/losses to one point per game feels like throwing the baby out with the bathwater. Specifically, convergence takes a long time, the result of which is that a very good player on e.g. a new account (smurf) will end up being the cause of a lot of unbalanced games for an awful long time.

Having played a lot of ranked LoL, I saw a few recurring but irrational gripes players had with the Elo based system:

- "I get matched with bad teammates and they drag me down". On average your teammates are the same Elo as you. All players get their fair share of games where they are/aren't the underdog side. On average, it averages out. Deal with it.

- "I've been stuck at the same Elo for ages but I should be higher". Nope, Elo only cares if you win or lose. It doesn't care about kill/death ratio, creep score or how many ganks you pull off. Focus on winning more. Incidentally, focusing on winning instead of secondary metrics like kills/CS was one of the biggest mindset differences between high/low Elo players.

"I should be higher Elo but I play support roles so can't climb". It may be true that you climb slower but here's the rub - think of your matchups as you being compared to the enemy team's support player. The other four roles on each team are actually a constant factor (by symmetry arguments you could not consistently find that your four teammates are any better/worse than the enemy support player's teammates). As a result, the only remaining factor in the statistical equation is you weighed up against the enemy support player. If you can provide even a slight statistical advantage towards winning vs them then you will climb the Elo ladder.


> As a result, the only remaining factor in the statistical equation is you weighed up against the enemy support player. If you can provide even a slight statistical advantage towards winning vs them then you will climb the Elo ladder.

An alternative explanation is that the skill ceiling is lower for support players.


Being a low skill player playing exclusively with and against other low skill players sucks. Imagine playing doubles tennis where all four players hit the ball directly into the net 80% of the time. Win or lose it would be an unpleasant experience. I think that's the root of many people's frustration with ranked matchmaking in e-sports games.

I mentioned this in another comment, but I think if e-sports wants to become truly culturally significant the games will need to figure out how to more elegantly bridge the gap between ranked and unranked play, and how to make it fun to play at low skill levels. I don't think it's a coincidence that Fortnite has done both.


It doesn't make sense as part of a rating system, but it may be good for community. Chess is pretty toxic with people refusing to play lower rated people, sometimes, because of danger to rating. Also, people with very low ratings are more likely to have their rating not represent their true ability (my son lost a game against a girl rated 300 that played a very accurate 1400-ish looking game against him... turns out she hadn't been in a rated game for 3 years despite being very active with her local chess club for that time; meanwhile it takes two tournaments or more to get those rating points back).


This is why you have two rating systems; one you show to people and is geared towards making players happy, and one that is used internally to create fair matches.


It seems reasonable to cap the number of points you can lose, but it strikes me as odd that you'd lose fewer points in an upset than in an even match-up.


Agreed on that point.

A better mechanism, IMO, even though it doesn't catch all the cases: provisional ratings, where you don't affect other peoples' ratings much for the first n games (and where your rating moves faster, too).

Effectively it acknowledges you have less of a prior when you've played few games.


Lichess is using glicko-2 which does pretty much that.


There are 2 issues being solved by "newbie suppression".

1) Cases where the ranking of the opponent is not well known. Either because they are on a new account, or recently had a massive change in skill level (say, practicing on a different account; or not playing and loosing ranking points due to the decay mechanic).

2) Inconsistent play. Ranking systems generally assume skill level is mostly static, with gradual changes over time. In practice, people have bad days. Limiting the influence of any single game reduces the noise introduced by inconsistent play at the expense of making convergence a slower process.


I agree that those are real problems that need to be dealt with. I don't think suppressing the effects of a loss against a lower-rated player solves them very well, though. The suppression could just be applied in those cases instead of in every case where there's a large rating disparity.

There are plenty of cases where a highly-rated player loses to a low-rated player fair and square. In those cases there should be significant rating adjustments -- at least more significant than a normal game between similarly-rated opponents -- but this system removes those to combat edge cases. I think it would be more effective to deal with those edge cases directly.


Newbies have a low rating because they’re new, not because they definitely suck.


Sure, but the suppression described doesn't apply only to new players, it applies to "someone substantially lower in rating than you."


Repeated Bernoulli trials give rise to Gaussian distributions which is where the e exponential comes from.

This an assumption and an approximation and is not necessarily a good fit. Pulling from actual probabilities would generally perform better.

The rest is massaging to better fit the different objectives.


If your curve is linear, it's because your game isn't that hard (or more formally, where winning and skill are less strongly correlated). This is tough for people to hear if their game is "designed to be a high-skill game".

The curve being linear means essentially that skill in the game confers less of a relative advantage. Chess is a good counterexample here, also rocket league. Both are games where difference in MMR is very strongly correlated with outcome, and both are games where skill is easily measured and highly correlated with ranking.


Take a look at TrueSkill, a much better mathematically grounded, created at Microsoft Research and being used at scale in Xbox: https://en.m.wikipedia.org/wiki/TrueSkill


TrueSkill definitely has a time decay term and I'm fairly sure it lets you fit the model to previous games. I wonder if the author actually tried it. (Though to be fair I'm not sure if there are open source versions of the latest version of TrueSkill.)


Yes, tried Glicko then TrueSkill, both generated huge amounts of complaints. New system produced few complaints. If the community had liked it, would've stuck with TrueSkill.


TrueSkill 1 presumably?



The parent is talking about TrueSkill 2 [0], while the trueskill Python library you linked currently only supports the original TrueSkill algorithm [1][2]. TrueSkill 2 takes into account individual scores of players in order to weigh the contribution of each player to each team. The idea is that this allows TrueSkill 2 to converge faster for new players.

[0] https://www.microsoft.com/en-us/research/publication/trueski...

[1] https://github.com/sublee/trueskill/issues/27

[2] https://www.microsoft.com/en-us/research/project/trueskill-r...


How about coop games — what would you use to rate players where the goal is to win together?


Wait why don’t we use a deep learning thingy on this dataset and just back out a formula that predicts the wins based on just the relative numbers of the people?


Nonsense - they're in the Rock and Roll Hall of Fame after all! Jeff Lynne is a musical genius.


Elo was in Age of Empires back when zone .com was a thing.

It worked and worked well. Points were calculated for each person. However Dots2 and lol don’t implement Elo the same way, points are calculated for the team. So if you’re Low score and you win against high people. In Dota and lol you won’t gain many points.

I believe this is done to avoid being carried but it doesn’t work because it just results in you being stuck in a Low tier for ages.

TLDR: elo works and it’s great. No one implements it right.

Edit: In Age of Empires / Zone, if you had a 4v4, it used all 8 players to calculate the ELO on an individual player, so if you had in your team 1750 elo, 1550 elo, and anything in between. The 1750 may gain only 1 point, while the 1550 may gain 16 points (the highest gain lowered the more people who played) While on the losing side the lowest elo will lose the lowest amount of points and the highest will lose the highest amount of points.

dota / lol don't do this, the winning/losing team gains/loses the same amount of points. This is wrong.

This means a high elo player has the potential to farm points from low elo players with little risk. While low elo players get stuck not playing people in their own range.


I recall at least one large previous thread about Elo but can't find it. Anyone?



This is useful to increase plays by reducing “ladder anxiety”


Isn't that a logarithmic curve?


It's a sigmoid, which converges to exponential far from 0 and is somewhat linear near 0.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: