Methodology
Hierarchical Bayesian skill model and per-PA Monte Carlo simulator, end to end
This project predicts a full distribution of runs scored per team per MLB game, then derives win, run-line, and totals probabilities by simulating each matchup roughly 10,000 times. Predictions are graded daily against sportsbook lines. Sports betting markets are used as a calibration benchmark, not as a gambling application. Sharp market participants push lines toward true probabilities quickly, which makes them a higher-quality probability signal than most independently constructed models.
The framework is borrowed from quantitative finance and decision theory: expected value, the Kelly criterion (Kelly 1956, originally an information-theory result), and probability calibration. The domain is baseball; the methods are standard in applied statistics and quant analysis.
The pipeline covers data ingestion from multiple APIs, feature derivation from pitch-level Statcast data, hierarchical Bayesian inference via NUTS, daily evaluation against actual outcomes, and a production frontend that refreshes each morning of the season.
The current model (v2, live since May 12, 2026) is a two-layer system: a hierarchical Bayesian skill model (Dirichlet-Multinomial over the eight plate-appearance outcomes, fit with NUTS via numpyro/JAX) followed by a per-PA Monte Carlo simulator that plays out each inning with rest-aware bullpen rules, an empirical baserunner-advancement table, and per-game posterior draws to propagate parameter uncertainty. It replaced a prior XGBoost regressor (v1, archived) after a 542-game head-to-head backtest in which v2 won on every metric.
Prediction target
Per-team run distribution per game
Skill model
Hierarchical Dirichlet-Multinomial (8 outcomes)
Sampler
NUTS via numpyro / JAX
Training data
401,826 PAs across 2024 + 2025 + 2026-YTD
Simulator
K=30 posterior draws × ~333 inning sims each (default ~10,000 sims total)
Probability output
Empirical quantiles with p10/p50/p90 bands
Sizing rule
Quarter-Kelly on flagged plays
Cutover from v1
May 12, 2026
SDP @ PHI
Thu, Jun 4, 2026 · 10:05 AM PT
PHI expected runs
4.75
p10–p90: 1.0 – 9.0
SDP expected runs
4.25
p10–p90: 1.0 – 8.0
PHI win probability
57.2%
p10–p90: 50.7% – 62.2%
Total runs: μ 9.00, p10–p90 4.0–15.0. Book O/U line: 8.5. Bars are the empirical PMF from the raw 10,000-sim run array, binned 0..20.
May 12, 2026
v2 cutover: hierarchical Bayesian skill model + per-PA Monte Carlo simulator
April 21, 2026 (v1)
Dynamic starter/bullpen inning split in batting-split features
See full changelog (2 older entries)
April 20, 2026 (v1)
Raised +EV thresholds from 3% to 4.5% (ML/RL) and 6.5% (totals)
April 2026 (v1)
Win probability switched from Poisson to negative binomial
The system is split into a skill layer (Bayesian inference, refit nightly) and a simulation layer (Monte Carlo, run per game per scoring pass). The skill layer learns time-stable parameters from years of Statcast data; the simulator consumes them to produce per-game run distributions that the markets layer turns into win, total, and run-line probabilities with edge and sizing.
Inputs
Statcast pitch data
~400k PAs, 2024+25+26-YTD
MLB Stats API
Schedule, lineups, rosters, boxscores
The Odds API
ML, RL, totals
Active bullpen workload
pitcher_workload table (rest)
Bayesian Skill Layer
Batter D-M
812 batters × 2 platoon cells
Pitcher D-M
1074 pitchers × 2 roles (SP/RP)
Park log-PF
30 venues, residual wOBA
Sampler
NUTS via numpyro/JAX, 4 chains × 2000 draws
Monte Carlo Simulator
Per-PA outcome sampler
Vectorized over 8 categories
Empirical advancement
P(state', runs | state, outs, outcome)
Rest-aware bullpen
Active roster + 1d/2d workload caps
K=30 posterior draws × ~333 inning sims
~10,000 simulated games per matchup
Markets & EV
Empirical win probability
p10/p50/p90 bands from posterior draws
Totals & RL distributions
Quantiles over simulated runs
Edge vs sportsbook
4.5% ML/RL, 6.5% totals
Quarter-Kelly sizing
0.25 × f*, capped
Nightly train-v2 runs a full NUTS refit on a GitHub Actions runner (~4 AM PT). On success it triggers daily-pipeline-v2 via workflow_run, guaranteeing scoring runs against fresh posteriors. A separate refresh-lineups-v2 cron checks every 30 minutes between 7 AM and 4 PM PT and re-scores any game whose posted lineup has changed (detected via SHA-1 hash of sorted batter IDs).
Predictions reflect the best information available up to first pitch: lineups, scratches, and late odds all feed back into the score. Once a game starts, its row is frozen and the evaluation ledger uses that final pre-game value. A flag visible on /games in the morning can disappear later if the posted lineup re-score drops the edge below threshold, and in that case the bet is also removed from /history and from model evaluation. Posterior refits happen on their own nightly schedule, independent of the intraday lineup refresh.
Schedule & Bullpen
Refresh games, scores, pitcher_workload per pitcher per day
Lineups & Odds
Posted lineups via Stats API; ML/RL/totals from The Odds API
Posterior Refit
Nightly NUTS run: batter, pitcher, park (12 min on M-series)
Score Games
K=30 posterior draws × ~333 inning sims each, ~10,000 total per matchup
Derive Markets
Empirical win prob, totals, RL from simulated run distributions
Verify & Publish
Anti-correlation, calibration, posterior-age checks; write to Supabase
Data sources
Each plate appearance is modeled as a categorical draw over eight outcomes: strikeout, walk, hit-by-pitch, single, double, triple, home run, in-play out. Three separate hierarchical models learn the additive log-odds offsets that each actor (batter, pitcher, venue) contributes to those eight logits, with OUT as the reference category.
Generative model (batter, schematic)
σb ~ HalfNormal(1)7 // batter-level scale per outcome
zi ~ Normal(0, 1)7 // non-centered batter offsets
βi = μ + σb ⊙ zi //batter i's log-odds vector
πi = softmax([0, βi])
yi ~ Multinomial(ni, πi) // aggregated per-batter PA counts
Four implementation details that matter at this scale:
- Non-centered parameterization. Sampling zi from a unit normal and forming βi = μ + σ ⊙ ziavoids Neal's funnel pathology in hierarchical models, where the geometry of the posterior becomes too distorted for HMC to traverse efficiently.
- Multinomial likelihood, not per-PA Categorical. Aggregating each batter's PAs into outcome counts and using pm.Multinomial is mathematically identical to a per-PA pm.Categorical but ~400× faster on 400k PAs (~3 min vs ~19 h projected). Each leapfrog step evaluates a per-actor log-likelihood rather than a per-row one.
- Platoon and role splits. Batters are fit per (batter, vs_LHP) cell; pitchers per (pitcher, role ∈ {SP, RP}). Position-player pitchers are dropped from the pitcher pool.
- Park as residual log-PF. Park is fit last, on the residual wOBA after batter and pitcher effects are accounted for. A per-venue scalar park_log[v]shifts the seven non-OUT logits additively, weighted by each outcome's wOBA coefficient, so HR and 3B move most under park and K is unaffected.
Sampler diagnostics (4 chains, 2000 draws, 2500 tune)
| Model | R-hat | min ESS | Divergences | Wall |
|---|---|---|---|---|
| Batter (812 × platoon) | 1.00 | 755 | 0 | 7.1 min |
| Pitcher (1074 × role) | 1.00 | 1707 | 0 | 4.5 min |
| Park (30 venues) | 1.00 | 15.9k | 0 | 6 s |
R-hat is the Gelman-Rubin convergence statistic; values near 1.00 indicate the chains have mixed and are sampling from the same posterior. Effective sample size (ESS) measures how many independent draws the autocorrelated chains are equivalent to; minimum ESS > 400 is the gate.
For each scheduled game we draw K=30 random posterior samples and run N inning-level simulations per draw. N defaults to ~333 in production (10,000 total sims) and 33 in the acceptance-gate test (990 total). Every draw is a coherent realization of all batter, pitcher, and park parameters together, so spread across draws becomes a posterior band on the win rate; the inner-loop variance is the aleatoric run-scoring noise within a fixed parameter setting.
Per-PA outcome sampler (vectorized)
+ park_log[v] · wOBA_weightk
+ form_noisek // sigma = 0.13, zero-sum across game
π = softmax([0, ℓ1..7])
outcome ~ Categorical(π)
Indices: batter b, pitcher p, role ∈ {SP, RP}, venue v, outcome k ∈ 1..7 (K, BB, HBP, 1B, 2B, 3B, HR; OUT is the reference 0 logit).
On contact, baserunner state evolves via an empirical advancement table built from 365k Statcast PAs (2024+25). The table is a flat lookup of P(new_state, runs_scored, outs_added | state, outs, outcome, out_subtype). Cells with fewer than 100 observations are linearly shrunk toward the outcome-conditional marginal P(· | outcome, subtype), with weight n_cell / 100 on the cell and the rest on the marginal. HR, BB, and HBP advances are deterministic forced moves; the empirical estimates were unreliable there because bases-loaded variants were sparse.
Bullpen management is rest-aware in live mode. Starters are pulled at pa_count ≥ 24 (a pitch-count proxy, roughly 95 pitches at 3.95 P/PA). Relievers are drawn from a queue built from the active 26-man roster, ordered by current rest, and any reliever with ≥ 6 outs in the last 1 day or ≥ 9 outs in the last 2 days is skipped. The per-pitcher workload table is refreshed in the daily pipeline from MLB boxscores.
Variance decomposition (per-game total runs)
+ Varform(per-game form noise, σ=0.13)
+ Varaleatoric(inning-level sim noise)
The form-noise scalar is calibrated against 2025 actuals to narrow the residual underdispersion in the empirical advancement table. It is not a tuning knob to be re-fit on market backtests; doing that mixes concerns and tends to mask modeling errors as noise.
Acceptance gate (200 stratified 2025 games × 990 sims × 2 sides = 396k team-game samples): simulated mean runs/team-game 4.59 vs actual 4.38 (+4.86%, clears the 5% mean gate); simulated variance 9.72 vs actual 10.32 (−5.86%, clears the relaxed 7% variance gate, misses the original 5%). The model is mildly underdispersed in the tails, so blowouts and large totals get slightly less probability mass than they should. Closing that gap is a v2.1 item.
Market probabilities are estimated directly from the simulated run distributions; there is no parametric assumption like Poisson or negative binomial on top of the sim output. Let hi, ai be the home and away runs in sim i for i = 1..N:
P(total > L) = (1/N) · Σi 𝟙[hi + ai> L]
P(home covers sh) = (1/N) · Σi 𝟙[hi − ai > −sh]
Sign convention: shis the book's home spread (negative for a home favorite, so sh= −1.5 means home must win by > 1.5). Pushes split 50/50 at integer lines.
The p10/p90 win-probability bandis computed per posterior draw, not from the inning-level samples. For each of the K=30 draws we compute that draw's win probability across its inner sims, then take p10 and p90 over the K draw-level probabilities. The band is a measure of posterior parameter uncertainty about the win rate, separated from the aleatoric run variance. Anti-correlation is enforced by construction: Paway(p10) = 1 − Phome(p90).
A play is flagged when the modeled probability exceeds the sportsbook's de-vigged implied probability by more than 4.5% (ML, RL) or 6.5% (totals). The totals bar is higher because totals markets are noisier and respond more slowly to fresh information. Sizing uses the Kelly criterion:
stake = clamp(0.25 · f*, 0, max_stake) // quarter-Kelly with cap
Full Kelly is theoretically optimal for long-run log-wealth growth (Thorp, Shannon, Kelly 1956) but draws down aggressively on losing streaks. Quarter-Kelly trades some growth rate for a much tighter drawdown distribution. A high-variance flag fires when a team's simulated runs stdev exceeds 4.0, surfacing games where the model itself is unusually uncertain about scoring.
Before cutover, v2 was benchmarked against the frozen v1 XGBoost baseline on every completed 2026 game where both models had a prediction and the v2 sim used a clean live-bullpen queue. v1's code is preserved verbatim at SHA a84b4dd in v2/evaluation/baseline_v1/ so the comparison is stable.
| Metric | v1 | v2 | Δ | Gate |
|---|---|---|---|---|
| Brier score (ML) | 0.2570 | 0.2393 | −6.88% | PASS |
| Log-loss (ML) | 0.7244 | 0.6713 | −7.33% | PASS |
| Max calibration gap | 41.92% | 3.20% | −38.7pp | PASS |
| ROI moneyline (flagged) | +15.82% | +26.25% | +10.4pp | — |
| ROI run line (flagged) | +5.04% | +14.96% | +9.9pp | — |
| ROI totals (flagged) | −12.83% | −7.58% | +5.3pp | — |
Brier and log-loss are both proper scoring rules, so the comparison is on calibration and sharpness together, not just hit rate on winners. v1's 41.9% max calibration gap is partly thin-bin variance (the worst decile had few games in it), but the directional read still holds: v2 reports a tighter and better-calibrated probability. v2 is also more selective on flagged plays (~20-25% fewer flags) yet posts higher ROI on all three markets, which is the expected pattern when a model gets sharper.
Backtest infrastructure is in v2/evaluation/ (replay populates the v2 prediction table for a date range; backtester joins v1, v2, and actuals and emits the head-to-head report). Live performance, including ongoing Brier and calibration, is on the Performance page.
Probabilistic Modeling
Simulation & Data
Database
Frontend
Orchestration
Expand v1 methodology
v1 was a gradient-boosted regressor (XGBoost, reg:squarederror) on 14 hand-crafted features per team per game, trained with TimeSeriesSplit cross-validation and GridSearchCV, retrained daily on all completed games of the season. It targeted expected runs per team directly; win, total, and run-line probabilities were derived afterward from a negative binomial joint score distribution with dispersion r=6, with isotonic regression on out-of-fold predictions for calibration.
Features (14)
| Feature | Description |
|---|---|
| xfip | Starter expected FIP from Statcast pitch data |
| xfip_bullpen | IP-weighted bullpen xFIP |
| starter_whip | Starter walks + hits per inning pitched |
| bullpen_k_9 | IP-weighted bullpen K/9 |
| batting_ops | Team OPS, dynamic starter/bullpen handedness blend |
| batting_iso | Team ISO (SLG − AVG), dynamic blend |
| batting_k_pct | Team K%, dynamic blend |
| avg_last5 | 5-game rolling avg runs scored |
| avg_last10 | 10-game rolling avg runs scored |
| std_last5 | 5-game rolling std dev (volatility) |
| park_factor | Venue run-scoring factor (Baseball Savant) |
| is_home | Home-field indicator (0/1) |
| own_bp_outs_2d | Own bullpen reliever outs in prior 2 days (rest) |
| opp_bp_outs_2d | Opp bullpen reliever outs in prior 2 days (fatigue) |
Why v2 replaced it
- v1 had no concept of individual batter or pitcher skill; everything was team-aggregate. Lineup changes, platoon advantages, and bullpen identity were invisible to the model.
- Win-probability calibration degraded in the tails (max gap 41.9% across deciles), driven by sparse decile bins on a point-estimate output.
- Parameter uncertainty was not represented. The model produced a point prediction; the negative binomial added run-scoring variance on top of it, but the model itself reported no confidence band.
- Feature engineering was a maintenance burden. Twelve to fourteen features required ongoing tuning of split blends, IP weights, and fallback rules. v2 replaces all of that with per-actor posteriors learned from raw PAs.
v1 predictions before May 12, 2026 are still served from the archive tables (model_outputs_v1_archive, model_outputs_season_v1_archive) so the historical record is unchanged. A green vertical line on every time-series chart marks the cutover.