What We Got Wrong This Season (Calibration Honesty, Q1 2026)
We've been live for a few months. Here are the matches our model got most badly wrong, what they tell us, and where the real biases are.
We've been live for a few months. Here are the matches our model got most badly wrong, what they tell us, and where the real biases are.
Most football tipster sites publish their wins. We're going to publish our losses.
Specifically, we're going to walk through the matches our model got most badly wrong this season — where we assigned high probability to one outcome and the opposite happened. Then we'll talk about what those misses tell us about where the model has structural blind spots, and what we're doing about them.
This is the first of what we'll publish quarterly. Calibration is the gold standard for evaluating a probabilistic forecaster, and the only way to maintain calibration over time is to publicly examine where you're miscalibrated.
A note on this post: the specific matches and probability values below will be filled in from our actual track record once we've finalised the Q1 numbers. The structure of the analysis — what categories of error we look for, how we frame "wrong" — is what we want to commit to publicly.
A few clarifications upfront.
Wrong is not "the favourite lost." A 60% prediction will be "wrong" 40% of the time. That's how probability works. We don't beat ourselves up for matches where our 60% pick lost — those matches are part of how 60% means 60%.
Wrong is "predicted probability didn't match observed frequency." If we say 80% on a hundred matches and the favourite wins 60 of them, we were systematically overconfident on those matches. That's a real failure that needs explaining.
Wrong is also "individual matches where our number was clearly off." A specific match where we said 85% and the favourite lost is a normal variance event. But a specific match where we said 85%, the favourite lost convincingly, and post-match analysis shows we missed an obvious factor (heavy injuries, manager ill, weather chaos) — that's the kind of "wrong" we should learn from individually.
We'll cover both kinds.
Across this quarter, our predictions in the 70-80% probability range have been the weakest calibrated.
We've assigned 70-80% probability on roughly [N] matches this quarter. Of those, [M] resulted in the predicted outcome. Calibration target was 75%; observed rate is [X%].
If [X] is meaningfully below 75%, we've been overconfident on favourites in this range. If [X] is meaningfully above, we've been underconfident.
We'll publish the actual numbers when we finalise the quarter. Our internal expectation, based on backtesting and similar models in the literature, is that we're slightly overconfident on favourites in this range — typically 2-4 percentage points high. This is a known issue with Dixon-Coles + Elo models when fit to one season of training data; the model learns that "favourites win" and over-applies it.
Our V2 model adds Platt scaling — an automated recalibration layer that smooths probabilities back toward observed frequencies — specifically to fix this kind of issue. It ships shortly. The Q2 calibration analysis will be the first to reflect the V2 numbers.
Three matches this quarter where our probability was furthest from the observed outcome:
[Match 1]: We said [X%], result was [outcome opposite to favourite].
Reconstructed factors: [list of factors that the model didn't capture — injury news, tactical change, weather, etc.]. This is the kind of miss that's correctable in principle (better team-news ingestion, weather data) but expensive to capture in practice. We'll keep accepting these for now.
[Match 2]: We said [Y%], result was [unexpected].
Reconstructed factors: [...]. This was a smaller miss but interesting because it pattern-matched to a recurring blind spot. Our model treats home advantage as a constant; this match suggests we should differentiate stadium-specific home advantage. Adding to the V2 roadmap.
[Match 3]: We said [Z%], result was [...].
Reconstructed factors: [...]. The miss here was just probability variance — the result was within the range our model considered plausible, just on the unfavourable side. Less of a structural lesson, more of a reminder that probability has tails.
The full calibration chart is on our calibration page, updated weekly. Across all matches this quarter:
The shape of the chart matters more than any individual number. A chart that's close to the diagonal across all bins is a calibrated model. A chart with systematic deviation in specific bins points to specific biases.
Three concrete changes to the model based on this quarter's data:
1. Recency-weighted form (V2 feature, shipping shortly). The current model uses long-run Elo without aggressive recency weighting. We've identified specific matches this quarter where teams in clearly better recent form lost to teams with stronger long-run ratings. The V2 weighting addresses this.
2. Calibration auto-recalibration via Platt scaling (V2 feature). Rather than manually retuning the model when calibration drifts, we'll apply a learned post-hoc adjustment that maps raw model probabilities to observed frequencies. This is standard practice in machine learning and should keep our calibration tighter without requiring full model rebuilds.
3. Confederation-adjusted strength (V2 feature). Currently relevant for international matches more than Premier League, but the principle (different competitive contexts have different baseline patterns) applies to domestic football too. We'll be exploring how to extend this idea to specific subset patterns within league play.
These features are part of the V2 model rebuild that ships this quarter.
Two known biases we've documented and chosen not to fix.
Injury data. Covered in detail in this post. The data is too poor to use reliably. We accept the cost.
Referee variance. Specific referees produce measurable differences in cards, penalties, and added time. Adding this would require ingesting refereeing data and tracking specific officials, which is more infrastructure than the marginal predictive gain justifies. We accept the cost.
These are deliberate trade-offs. Telling you about them publicly is the only way the trade-offs stay honest.
A few reasons we publish this.
External calibration. Anyone can claim their model is good. Showing the misses lets the reader judge for themselves. If you spot patterns we missed, we want to know.
Internal discipline. Knowing this post is coming forces us to actually run the calibration analysis on a regular cadence. Without it, calibration drift is invisible until something dramatic happens.
Trust building. A model that publishes losses is more credible than one that only publishes wins. Most readers know this intuitively but rarely demand it. We're trying to set the standard.
Differentiation. Almost every other "AI football" tool you can subscribe to won't show you their misses. They'll show you a hand-picked record of recent winners. The fact that we publish this honestly is itself a signal about how the rest of the operation works.
We'll publish the Q2 update in three months. By then, the V2 model will have been live for two months and we'll have data on whether the recalibration features actually fix the issues we've identified above.
If the V2 calibration is worse than V1 in any specific bin, we'll publish that too.
OddsIQ provides AI analysis, not financial or betting advice. Past performance does not guarantee future results. Gamble responsibly: BeGambleAware, GamCare, GamStop.