The Elo Problem: When Fair Teams Don't Feel Fair

There is a fight that happens almost every week before a game at Atmos Football, and it goes like this.

The app generates two balanced teams. It considers every possible way to split the players (however many turn up that night), scores each combination by how close the two sides are in average Elo rating, and picks the fairest option. It shows you the teams, the average Elo of each side, the difference between them, and the estimated win probability. On a good night the gap is under 30 points and the probability sits close to 50/50.

Then someone looks at the teams and says: "There's no way that's fair."

What They See vs What the App Sees

The objection is always the same shape. One team has two players who "feel" strong together. Or one side looks physically bigger and faster. Or last week that combination won 8-2, and now the app wants to put the same players together again. The eye test says one side is clearly better.

So the organiser (usually me) adjusts. Swaps two players. The app instantly recalculates. The Elo difference jumps from 25 to 90 points. The win probability shifts from 51/49 to 62/38. The app is now telling you, plainly, that the adjusted teams are not as fair.

The group plays anyway, because the eye test won the argument.

Then the game finishes, and the team the app said was favoured — the one with the 62% win probability after the swap — wins. Not always by a landslide. Sometimes it's close. But often enough that you start to wonder whether the numbers knew something the eye test didn't.

What Elo Actually Is

To understand why the app is usually right, you need to understand what Elo is measuring — and what it isn't.

Elo is a rating system originally designed for chess by a Hungarian-American physicist named Arpad Elo in the 1960s. The basic idea is simple: every player has a number. When you win, your number goes up. When you lose, it goes down. The clever part is how much it moves.

If you beat someone rated much higher than you, your rating jumps. You've proven something. If you beat someone rated much lower, it barely moves — you were supposed to win. The system automatically adjusts for the quality of every opponent in every game.

In Atmos Football, every player starts at 1500. After each game the app recalculates everyone's rating. In our group the typical range sits between about 1350 and 1650. That 200–300 point spread between the top and bottom player doesn't sound huge, but it represents a consistent, measurable edge over many games.

The win probability between two sides is calculated from the difference in their average ratings. A 200-point gap roughly translates to a 76% chance of winning for the higher-rated side — similar to how it works in chess or in FIFA's professional rankings.

Why 5-a-Side Elo Is Harder Than Chess

Chess is one versus one. Football is five versus five (or whatever number we have that night). That creates real problems.

In chess, if you beat someone, the credit is entirely yours. In 5-a-side, you won as part of a team. If your side won 7-2, how much of that was you? How much was the other four players carrying the game while you stood in the wrong position for half the match?

The app handles this in several ways:

Team averages. Each team's "rating" is the average Elo of its players. The team generator's main job is to make those two averages as close as possible by checking every possible combination.

Margin of victory. A 7-2 win shifts ratings more than a tight 4-3 win. But it's capped — a 10-0 blowout doesn't move the needle much more than a 5-0, because at some point the losing side has usually stopped trying. This is a separate multiplier applied on top of the base update — it handles score sensitivity, not the size of the rating change itself.

K-factor. The K-factor controls how much a single result can move your rating. It has nothing to do with how close the game was — that's the margin-of-victory multiplier above. K is based purely on experience: new players start with a high K (fast-moving ratings) because the system is still finding their level, and it decreases as they accumulate games. After around ~20 games the swings settle down. This is deliberate: you want the system to find the right rating quickly for a new player, then hold it steadily once confidence is established.

Inactivity decay. Players who miss a long stretch of games drift slowly back toward 1500. This stops someone from sitting at the top of the leaderboard permanently just because they haven't played recently.

Multi-pass convergence. Because each player's rating depends on every other player's rating (your wins are worth more if the people you beat later turn out to be strong), the app runs the entire historical calculation ten times. Each pass refines the previous one — the early passes do the heavy lifting and by the final passes the corrections are negligible.

These adaptations make Elo workable for casual 5-a-side, even if it's never going to be as precise as in individual sports.

So Why Do Fair Teams Feel Wrong?

The answer is that the eye test uses different data than the Elo system, and much of that data is unreliable.

When you look at the two teams and think "that's not fair," you're usually processing something like: "Dave and Chris were dominant together last week" or "that side looks bigger and faster on paper." Those observations are real. But they're drawn from a tiny sample — one or two recent games — weighted heavily by how dramatic or memorable the moment was.

Elo is drawn from every game. All of them. It has already accounted for the fact that Dave and Chris were dominant last week — their ratings went up accordingly. It has already accounted for the 8-2 blowout — the ratings shifted. The information you're trying to add by swapping players is information the system already has, just averaged over a much larger and more consistent dataset.

The eye test is also terrible at separating individual skill from team composition and temporary form. You remember Dave being brilliant, but Dave was playing alongside three other strong players that night. His rating reflects his performance across dozens of different team configurations — not just one cherry-picked memory from last Thursday.

Chemistry — the mysterious "these two just play well together" factor — is real, and the app does attempt to measure it. But even there, the data shows it's less powerful than raw individual skill over the long run. Most of what looks like chemistry turns out to be two good players being good at the same time.

The Adjustment Tax

Here's what the data shows over time. When the organiser accepts the app's top suggestion, the predicted win probability is typically close to 50/50. When the organiser adjusts — swaps players based on the eye test or vibes — the probability shifts, sometimes dramatically.

The app displays this in real time. As players are swapped between teams, the Elo gap and win probability update instantly. It is, in effect, showing you the cost of the adjustment.

After the game, the post-match summary compares what was predicted to what actually happened: the predicted winner, the predicted score, and the actual result.

The prediction isn't always right. It can't be — this is casual football with variable attendance, tired legs, and people who sometimes can't be bothered tracking back. But over dozens of games the pattern is consistent: when teams are adjusted away from the optimal balance, the favoured side tends to win. Not every time. Often enough.

The app doesn't have an opinion about whether you should override it. It just shows you the numbers before and after, and then shows you what happened. You can draw your own conclusions.

The Real Question

The interesting thing isn't whether the Elo system is perfectly accurate — it isn't, and it never will be in a small casual group with limited data.

The interesting question is whether the group will trust it.

A number on a screen telling you that your instinct is wrong is an uncomfortable experience. The players who accept it tend to be the ones who have watched the predictions come true often enough to override their own eye test. The players who resist it tend to be the ones who remember the time the app was wrong far more vividly than the many times it was quietly right.

Both reactions are human. The data doesn't care about either one.

Every Thursday the same argument happens, and every Thursday the app quietly shows its working. Whether the group listens is a different question — and honestly, the answer changes week to week. Some nights we trust the numbers and the games feel surprisingly even. Other nights we ignore them, someone complains about the teams, and the side the app warned us about wins anyway.

The argument repeats not because people are irrational but because the stakes are asymmetric. If you insisted the teams were unfair and then lost, you were right about something. If you accepted the app's suggestion and lost, you just lost. The cost of being wrong about the teams is low. The cost of admitting the app knows better than your instincts is social. That asymmetry keeps the argument alive regardless of what the data shows.

— James