Most predictions in football betting are sold to you with a number. "Our AI is 73% accurate." "We hit 4 of 5 picks last weekend." Sometimes a screenshot of last week's winners.
These metrics are mostly meaningless. They confuse hit rate with accuracy, ignore the probability the forecaster assigned, and almost never tell you whether the forecaster is overconfident, underconfident, or biased in a specific direction.
The actual right way to judge a probabilistic forecaster is a calibration chart. We publish ours here. This post is about how to read one β yours, ours, or anyone else's brave enough to show one.
The question calibration answers
A forecaster says "Manchester City have a 70% chance of beating Burnley." What should that mean?
It should mean: across all the matches where the forecaster says the favourite has a 70% chance, the favourite wins about 70% of the time.
Not 70% of one match (you can't win 70% of one match). 70% of all the matches assigned that probability across a long enough sample.
If a forecaster says 70% and the team only wins 55% of the time, the forecaster is overconfident at the 70% level. If the team wins 85% of the time, the forecaster is underconfident β they should have said 85%.
Both errors are real failures, even though one looks like "your model was too cautious" and the other looks like "your model bragged too much." Both mean the number you were given doesn't match reality.
What a calibration chart actually shows
A calibration chart has two axes:
- βX-axis: the probability the forecaster assigned (0% to 100%)
- βY-axis: the actual outcome rate (0% to 100%)
The dataset is grouped into bins. All predictions where the forecaster said "between 50% and 60%" go in one bin; all predictions of "between 60% and 70%" go in another; and so on.
For each bin, you plot:
- βThe average predicted probability (X)
- βThe actual hit rate of that bin (Y)
If the forecaster is well-calibrated, the dots line up along the diagonal. A 70% prediction lands roughly 70% of the time; an 85% prediction lands roughly 85% of the time.
If the dots fall systematically below the diagonal, the forecaster is overconfident β they predict more than they deliver.
If the dots fall systematically above the diagonal, they're underconfident.
If the dots wobble around the diagonal randomly, it's noise from a small sample.
What good calibration looks like
A well-calibrated forecaster's chart has:
- βMost bins close to the diagonal (within 5 percentage points)
- βThe deviations look random, not systematic
- βThe bins with the most data (usually the middle ranges, 40-60%) are the most accurate
- βThe extremes (under 10%, over 90%) might be off because there's not much data there
Importantly: a well-calibrated forecaster is not the same as an accurate one.
A forecaster who predicts 50% on every single match would be perfectly calibrated if exactly 50% of matches happen. But they'd also be useless β they're not telling you anything you didn't already know.
The best forecaster has predictions that are both calibrated (probabilities mean what they say) and discriminating (they confidently say 80% on the right matches and confidently say 20% on the right ones, instead of hedging at 50% on everything).
What bad calibration looks like
Three failure modes show up on real charts.
Overconfidence near the extremes
Most common. The forecaster's high-confidence picks (80%+) win less often than 80% of the time. Their low-confidence dismissals (under 20%) lose more often than predicted β the underdogs they wrote off come through.
This shape is visible as a flattening near the extremes. The dots at 80% and 90% drop below the diagonal; the dots at 10% and 20% rise above it.
Causes: model overfitting to favourites, ignoring tail risk, not factoring in variance properly. Common in models built on point estimates rather than full distributions.
Systematic bias toward favourites
The forecaster consistently overrates favourites and underrates underdogs. The dots fall below the diagonal across the upper half and above the diagonal across the lower half.
Causes: training the model on a dataset where favourites win disproportionately (which is true in football, but the model has overcorrected). Favouring big-name teams in feature engineering.
Bin-specific weirdness
The 30-40% bin is consistently 10 points off, but every other bin is fine. There's something specific about that probability range that the model gets wrong.
Causes: usually a model boundary. The way our model handles "close-to-toss-up" matches is different from how it handles "clearly favoured" or "clearly underdog" matches, because Dixon-Coles and Poisson behave slightly differently in those regimes.
This is the kind of thing you find from publishing the chart. We've found bin-specific issues in our own model and fixed them. We could not have found them without the chart.
How to interpret sample sizes
Calibration data has a sample-size problem. A bin with 5 predictions in it tells you almost nothing β random variance dominates.
When reading our chart (or anyone else's), check:
- βHow many predictions per bin? Most published charts annotate this. We do.
- βWhat's the total sample? Under a few hundred predictions, your "calibration" is mostly luck.
- βWhat time period? A model that was calibrated 12 months ago may have drifted; a model that was calibrated only on the last 30 matches doesn't have enough data.
A genuinely calibrated forecaster's chart looks tighter as the sample grows. Yours might be drifting β ours has been, in specific bins, and we publish how we're correcting it.
Why almost no tipster will show you their calibration chart
Two reasons.
The first is that most tipster sites don't actually have one. They don't track outcomes systematically against the probabilities they assigned. They track wins and losses against bookmaker odds. That's a different metric, and it doesn't tell you whether their probabilities mean anything.
The second is that for those who do track it, the chart is usually bad. Severe overconfidence at the extremes, systematic bias toward favourites, or both. Showing it would make their predictions look less impressive than the marketing.
The forecasters who do publish calibration charts are signalling something. Either they're confident the chart looks decent, or they're philosophically committed to transparency even when it doesn't.
We try to be the second. Sometimes the chart looks fine. Sometimes a specific bin shows we're 8 percentage points off, and we have to write a follow-up explaining why.
Common questions
"What if the chart is just close to the diagonal because the model predicts the obvious?"
This is a good question. A model that predicts 90% on Manchester City vs Crystal Palace and 50% on a derby will look calibrated even if it's mostly stating the obvious. The check for this is discrimination β does the model produce a wide spread of predictions, or is it bunched up near 50%? Spread + calibration = useful. Bunched + calibration = uninformative.
"What sample size do I need before I trust calibration data?"
Several hundred predictions, minimum. Several thousand for confident statements about specific bins. Our calibration page shows our current sample and the per-bin counts.
"Can a model be calibrated for one league and not another?"
Absolutely. Premier League calibration tells you nothing about World Cup calibration. International tournaments have different dynamics, smaller samples, more variance per match. We track these separately.
"What's the worst calibration miss you've published?"
Without giving away the headline numbers, our 70-80% probability bin has been our weakest historically. We've over-predicted favourites in that range. The fix is in our Phase 2 model rebuild, which is shipping shortly. The before/after chart will be on the calibration page.
Where to look next
If you take one thing from this post: when a forecaster shows you their accuracy, ask for the calibration chart. If they don't have one or won't show it, that's the answer.
OddsIQ provides AI analysis, not financial or betting advice. Past performance does not guarantee future results. Gamble responsibly: BeGambleAware, GamCare, GamStop.