Methodology

Hierarchical Bayesian skill model and per-PA Monte Carlo simulator, end to end

Project Overview

Hierarchical Bayesian skill model + per-PA Monte Carlo simulator for probabilistic run prediction

This project predicts a full distribution of runs scored per team per MLB game, then derives win, run-line, and totals probabilities by simulating each matchup roughly 10,000 times. Predictions are graded daily against sportsbook lines. Sports betting markets are used as a calibration benchmark, not as a gambling application. Sharp market participants push lines toward true probabilities quickly, which makes them a higher-quality probability signal than most independently constructed models.

The framework is borrowed from quantitative finance and decision theory: expected value, the Kelly criterion (Kelly 1956, originally an information-theory result), and probability calibration. The domain is baseball; the methods are standard in applied statistics and quant analysis.

The pipeline covers data ingestion from multiple APIs, feature derivation from pitch-level Statcast data, hierarchical Bayesian inference via NUTS, daily evaluation against actual outcomes, and a production frontend that refreshes each morning of the season.

The current model (v2, live since May 12, 2026) is a two-layer system: a hierarchical Bayesian skill model (Dirichlet-Multinomial over the eight plate-appearance outcomes, fit with NUTS via numpyro/JAX) followed by a per-PA Monte Carlo simulator that plays out each inning with rest-aware bullpen rules, an empirical baserunner-advancement table, and per-game posterior draws to propagate parameter uncertainty. It replaced a prior XGBoost regressor (v1, archived) after a 542-game head-to-head backtest in which v2 won on every metric.

Prediction target

Per-team run distribution per game

Skill model

Hierarchical Dirichlet-Multinomial (8 outcomes)

Sampler

NUTS via numpyro / JAX

Training data

401,826 PAs across 2024 + 2025 + 2026-YTD

Simulator

K=30 posterior draws × ~333 inning sims each (default ~10,000 sims total)

Probability output

Empirical quantiles with p10/p50/p90 bands

Sizing rule

Quarter-Kelly on flagged plays

Cutover from v1

May 12, 2026

Example Output: Today's Featured Game

Per-team simulated run distributions, win probability band, and total runs percentiles

MIN @ CLE

Tue, Jul 21, 2026 · 3:40 PM PT

CLE expected runs

4.53

p10–p90: 1.0 – 9.0

MIN expected runs

3.81

p10–p90: 1.0 – 8.0

CLE win probability

59.7%

p10–p90: 53.6% – 69.1%

Total runs: μ 8.00, p10–p90 3.0–14.0. Book O/U line: 8.5. Bars are the empirical PMF from the raw 10,000-sim run array, binned 0..20.

Model Changelog

Significant updates to model behavior, inference, and data handling, most recent first

May 12, 2026

v2 cutover: hierarchical Bayesian skill model + per-PA Monte Carlo simulator

Replaced the v1 XGBoost regressor with a two-layer probabilistic system. The skill layer fits three hierarchical Dirichlet-Multinomial models (batter, pitcher, park) via NUTS with full convergence diagnostics. The simulator runs ~10,000 vectorized games per matchup (K=30 posterior draws × ~333 inning sims each), propagating parameter uncertainty through the draws and feeding an empirical baserunner-advancement table built from 365k PAs of Statcast data. Backtest over 542 games (Mar 26 to May 9, 2026): Brier −6.9%, log-loss −7.3%, max calibration gap from 41.9% down to 3.2%, ROI improved on every market. The v1 model is preserved in archive tables and described in the legacy section below.

April 21, 2026 (v1)

Dynamic starter/bullpen inning split in batting-split features

v1 batting-split features started weighting starter vs bullpen innings by the opposing starter's trailing IP per start, replacing the fixed 60/40 assumption. Carried forward conceptually into v2 via per-PA pitcher identity.

See full changelog (2 older entries)

April 20, 2026 (v1)

Raised +EV thresholds from 3% to 4.5% (ML/RL) and 6.5% (totals)

Below the typical vig on a −110 line there is no cushion for model miscalibration. v2 inherits these thresholds for apples-to-apples comparison; v2.1 will retune them from edge-bucket ROI on live v2 data.

April 2026 (v1)

Win probability switched from Poisson to negative binomial

MLB run-scoring is overdispersed relative to Poisson. Negative binomial with r=6 brought v1 tail probabilities into line. In v2, win probability is derived empirically from the simulated run distributions and no parametric assumption about the run-scoring distribution is required.

System Architecture

From pitch-level data to calibrated market probabilities

The system is split into a skill layer (Bayesian inference, refit nightly) and a simulation layer (Monte Carlo, run per game per scoring pass). The skill layer learns time-stable parameters from years of Statcast data; the simulator consumes them to produce per-game run distributions that the markets layer turns into win, total, and run-line probabilities with edge and sizing.

Inputs

Statcast pitch data
~400k PAs, 2024+25+26-YTD
MLB Stats API
Schedule, lineups, rosters, boxscores
The Odds API
ML, RL, totals
Active bullpen workload
pitcher_workload table (rest)

↓

→

Bayesian Skill Layer

Batter D-M
812 batters × 2 platoon cells
Pitcher D-M
1074 pitchers × 2 roles (SP/RP)
Park log-PF
30 venues, residual wOBA
Sampler
NUTS via numpyro/JAX, 4 chains × 2000 draws

↓

→

Monte Carlo Simulator

Per-PA outcome sampler
Vectorized over 8 categories
Empirical advancement
P(state', runs | state, outs, outcome)
Rest-aware bullpen
Active roster + 1d/2d workload caps
K=30 posterior draws × ~333 inning sims
~10,000 simulated games per matchup

↓

→

Markets & EV

Empirical win probability
p10/p50/p90 bands from posterior draws
Totals & RL distributions
Quantiles over simulated runs
Edge vs sportsbook
4.5% ML/RL, 6.5% totals
Quarter-Kelly sizing
0.25 × f*, capped

Daily Pipeline

Six-step orchestration. Nightly refit then morning scoring, plus intraday lineup re-scoring

Nightly train-v2 runs a full NUTS refit on a GitHub Actions runner (~4 AM PT). On success it triggers daily-pipeline-v2 via workflow_run, guaranteeing scoring runs against fresh posteriors. A separate refresh-lineups-v2 cron checks every 30 minutes between 7 AM and 4 PM PT and re-scores any game whose posted lineup has changed (detected via SHA-1 hash of sorted batter IDs).

Predictions reflect the best information available up to first pitch: lineups, scratches, and late odds all feed back into the score. Once a game starts, its row is frozen and the evaluation ledger uses that final pre-game value. A flag visible on /games in the morning can disappear later if the posted lineup re-score drops the edge below threshold, and in that case the bet is also removed from /history and from model evaluation. Posterior refits happen on their own nightly schedule, independent of the intraday lineup refresh.

Schedule & Bullpen

Refresh games, scores, pitcher_workload per pitcher per day

→

Lineups & Odds

Posted lineups via Stats API; ML/RL/totals from The Odds API

→

Posterior Refit

Nightly NUTS run: batter, pitcher, park (12 min on M-series)

→

Score Games

K=30 posterior draws × ~333 inning sims each, ~10,000 total per matchup

→

Derive Markets

Empirical win prob, totals, RL from simulated run distributions

→

Verify & Publish

Anti-correlation, calibration, posterior-age checks; write to Supabase

Data sources

MLB Stats APIStatcast / pybaseballThe Odds APISupabase posteriors cache

Bayesian Skill Layer

Hierarchical Dirichlet-Multinomial per actor, fit with NUTS in numpyro/JAX

Each plate appearance is modeled as a categorical draw over eight outcomes: strikeout, walk, hit-by-pitch, single, double, triple, home run, in-play out. Three separate hierarchical models learn the additive log-odds offsets that each actor (batter, pitcher, venue) contributes to those eight logits, with OUT as the reference category.

Generative model (batter, schematic)

μ ~ Normal(0, 1)₇ // per-outcome league intercept
σ_b ~ HalfNormal(1)₇ // batter-level scale per outcome
z_i ~ Normal(0, 1)₇ // non-centered batter offsets
β_i = μ + σ_b ⊙ z_i //batter i's log-odds vector
π_i = softmax([0, β_i])
y_i ~ Multinomial(n_i, π_i) // aggregated per-batter PA counts

Four implementation details that matter at this scale:

Non-centered parameterization. Sampling z_i from a unit normal and forming β_i = μ + σ ⊙ z_iavoids Neal's funnel pathology in hierarchical models, where the geometry of the posterior becomes too distorted for HMC to traverse efficiently.
Multinomial likelihood, not per-PA Categorical. Aggregating each batter's PAs into outcome counts and using pm.Multinomial is mathematically identical to a per-PA pm.Categorical but ~400× faster on 400k PAs (~3 min vs ~19 h projected). Each leapfrog step evaluates a per-actor log-likelihood rather than a per-row one.
Platoon and role splits. Batters are fit per (batter, vs_LHP) cell; pitchers per (pitcher, role ∈ {SP, RP}). Position-player pitchers are dropped from the pitcher pool.
Park as residual log-PF. Park is fit last, on the residual wOBA after batter and pitcher effects are accounted for. A per-venue scalar park_log[v]shifts the seven non-OUT logits additively, weighted by each outcome's wOBA coefficient, so HR and 3B move most under park and K is unaffected.

Sampler diagnostics (4 chains, 2000 draws, 2500 tune)

Model	R-hat	min ESS	Wall
Batter (812 × platoon)	1.00	755	7.1 min
Pitcher (1074 × role)	1.00	1707	4.5 min
Park (30 venues)	1.00	15.9k	6 s

R-hat is the Gelman-Rubin convergence statistic; values near 1.00 indicate the chains have mixed and are sampling from the same posterior. Effective sample size (ESS) measures how many independent draws the autocorrelated chains are equivalent to; minimum ESS > 400 is the gate.

Per-PA Monte Carlo Simulator

K=30 posterior draws × inning-level sims per draw, propagating parameter uncertainty

For each scheduled game we draw K=30 random posterior samples and run N inning-level simulations per draw. N defaults to ~333 in production (10,000 total sims) and 33 in the acceptance-gate test (990 total). Every draw is a coherent realization of all batter, pitcher, and park parameters together, so spread across draws becomes a posterior band on the win rate; the inner-loop variance is the aleatoric run-scoring noise within a fixed parameter setting.

Per-PA outcome sampler (vectorized)

ℓ_k = batter_offset[b, hand_k] + pitcher_offset[p, role, hand_k]
+ park_log[v] · wOBA_weight_k
+ form_noise_k // sigma = 0.13, zero-sum across game
π = softmax([0, ℓ_1..7])
outcome ~ Categorical(π)

Indices: batter b, pitcher p, role ∈ {SP, RP}, venue v, outcome k ∈ 1..7 (K, BB, HBP, 1B, 2B, 3B, HR; OUT is the reference 0 logit).

On contact, baserunner state evolves via an empirical advancement table built from 365k Statcast PAs (2024+25). The table is a flat lookup of P(new_state, runs_scored, outs_added | state, outs, outcome, out_subtype). Cells with fewer than 100 observations are linearly shrunk toward the outcome-conditional marginal P(· | outcome, subtype), with weight n_cell / 100 on the cell and the rest on the marginal. HR, BB, and HBP advances are deterministic forced moves; the empirical estimates were unreliable there because bases-loaded variants were sparse.

Bullpen management is rest-aware in live mode. Starters are pulled at pa_count ≥ 24 (a pitch-count proxy, roughly 95 pitches at 3.95 P/PA). Relievers are drawn from a queue built from the active 26-man roster, ordered by current rest, and any reliever with ≥ 6 outs in the last 1 day or ≥ 9 outs in the last 2 days is skipped. The per-pitcher workload table is refreshed in the daily pipeline from MLB boxscores.

Variance decomposition (per-game total runs)

Var(total_runs) = Var_posterior(parameter draws)
+ Var_form(per-game form noise, σ=0.13)
+ Var_aleatoric(inning-level sim noise)

The form-noise scalar is calibrated against 2025 actuals to narrow the residual underdispersion in the empirical advancement table. It is not a tuning knob to be re-fit on market backtests; doing that mixes concerns and tends to mask modeling errors as noise.

Acceptance gate (200 stratified 2025 games × 990 sims × 2 sides = 396k team-game samples): simulated mean runs/team-game 4.59 vs actual 4.38 (+4.86%, clears the 5% mean gate); simulated variance 9.72 vs actual 10.32 (−5.86%, clears the relaxed 7% variance gate, misses the original 5%). The model is mildly underdispersed in the tails, so blowouts and large totals get slightly less probability mass than they should. Closing that gap is a v2.1 item.

From Simulated Runs to Market Probabilities

Empirical quantiles over the simulated run distributions, with per-draw posterior bands

Market probabilities are estimated directly from the simulated run distributions; there is no parametric assumption like Poisson or negative binomial on top of the sim output. Let h_i, a_i be the home and away runs in sim i for i = 1..N:

P(home wins) = (1/N) · Σ_i 𝟙[h_i > a_i]
P(total > L) = (1/N) · Σ_i 𝟙[h_i + a_i> L]
P(home covers s_h) = (1/N) · Σ_i 𝟙[h_i − a_i > −s_h]
Sign convention: s_his the book's home spread (negative for a home favorite, so s_h= −1.5 means home must win by > 1.5). Pushes split 50/50 at integer lines.

The p10/p90 win-probability bandis computed per posterior draw, not from the inning-level samples. For each of the K=30 draws we compute that draw's win probability across its inner sims, then take p10 and p90 over the K draw-level probabilities. The band is a measure of posterior parameter uncertainty about the win rate, separated from the aleatoric run variance. Anti-correlation is enforced by construction: P_away(p10) = 1 − P_home(p90).

A play is flagged when the modeled probability exceeds the sportsbook's de-vigged implied probability by more than 4.5% (ML, RL) or 6.5% (totals). The totals bar is higher because totals markets are noisier and respond more slowly to fresh information. Sizing uses the Kelly criterion:

f* = (p · b − q) / b // p = model prob, q = 1−p, b = decimal odds − 1
stake = clamp(0.25 · f*, 0, max_stake) // quarter-Kelly with cap

Full Kelly is theoretically optimal for long-run log-wealth growth (Thorp, Shannon, Kelly 1956) but draws down aggressively on losing streaks. Quarter-Kelly trades some growth rate for a much tighter drawdown distribution. A high-variance flag fires when a team's simulated runs stdev exceeds 4.0, surfacing games where the model itself is unusually uncertain about scoring.

Head-to-Head Backtest vs Frozen v1

542 games, March 26 to May 9, 2026. v2 wins on every metric.

Before cutover, v2 was benchmarked against the frozen v1 XGBoost baseline on every completed 2026 game where both models had a prediction and the v2 sim used a clean live-bullpen queue. v1's code is preserved verbatim at SHA a84b4dd in v2/evaluation/baseline_v1/ so the comparison is stable.

Metric	v1	v2	Δ	Gate
Brier score (ML)	0.2570	0.2393	−6.88%	PASS
Log-loss (ML)	0.7244	0.6713	−7.33%	PASS
Max calibration gap	41.92%	3.20%	−38.7pp	PASS
ROI moneyline (flagged)	+15.82%	+26.25%	+10.4pp	—
ROI run line (flagged)	+5.04%	+14.96%	+9.9pp	—
ROI totals (flagged)	−12.83%	−7.58%	+5.3pp	—

Brier and log-loss are both proper scoring rules, so the comparison is on calibration and sharpness together, not just hit rate on winners. v1's 41.9% max calibration gap is partly thin-bin variance (the worst decile had few games in it), but the directional read still holds: v2 reports a tighter and better-calibrated probability. v2 is also more selective on flagged plays (~20-25% fewer flags) yet posts higher ROI on all three markets, which is the expected pattern when a model gets sharper.

Backtest infrastructure is in v2/evaluation/ (replay populates the v2 prediction table for a date range; backtester joins v1, v2, and actuals and emits the head-to-head report). Live performance, including ongoing Brier and calibration, is on the Performance page.

Tech Stack

Production tools across probabilistic modeling, simulation, and frontend

Probabilistic Modeling

PyMCnumpyroJAXarvizNUTS sampler

Simulation & Data

NumPy (vectorized)pandaspybaseballStatcast pitch dataparquet cache

Database

Supabase (PostgreSQL)SQLAlchemyRLS public_read

Frontend

Next.js 16TypeScriptTailwind CSSshadcn/uiRecharts

Orchestration

GitHub Actions (cron + workflow_run)MLB Stats APIThe Odds API

A note on the sampler stack: PyMC describes the model; numpyro provides the JAX-backed NUTS implementation that actually samples. Pinning matters: numpyro 0.20.1 + jax 0.7.2 + jaxlib 0.7.2. Newer JAX removed an internal primitive (xla_pmap_p) that numpyro depends on, which breaks sampling silently.

Legacy: v1 XGBoost Model (pre-2026-05-12)

Predictions on the History and Performance tabs before the green cutover line come from this model

Expand v1 methodology

v1 was a gradient-boosted regressor (XGBoost, reg:squarederror) on 14 hand-crafted features per team per game, trained with TimeSeriesSplit cross-validation and GridSearchCV, retrained daily on all completed games of the season. It targeted expected runs per team directly; win, total, and run-line probabilities were derived afterward from a negative binomial joint score distribution with dispersion r=6, with isotonic regression on out-of-fold predictions for calibration.

Features (14)

Feature	Description
xfip	Starter expected FIP from Statcast pitch data
xfip_bullpen	IP-weighted bullpen xFIP
starter_whip	Starter walks + hits per inning pitched
bullpen_k_9	IP-weighted bullpen K/9
batting_ops	Team OPS, dynamic starter/bullpen handedness blend
batting_iso	Team ISO (SLG − AVG), dynamic blend
batting_k_pct	Team K%, dynamic blend
avg_last5	5-game rolling avg runs scored
avg_last10	10-game rolling avg runs scored
std_last5	5-game rolling std dev (volatility)
park_factor	Venue run-scoring factor (Baseball Savant)
is_home	Home-field indicator (0/1)
own_bp_outs_2d	Own bullpen reliever outs in prior 2 days (rest)
opp_bp_outs_2d	Opp bullpen reliever outs in prior 2 days (fatigue)

Why v2 replaced it

v1 had no concept of individual batter or pitcher skill; everything was team-aggregate. Lineup changes, platoon advantages, and bullpen identity were invisible to the model.
Win-probability calibration degraded in the tails (max gap 41.9% across deciles), driven by sparse decile bins on a point-estimate output.
Parameter uncertainty was not represented. The model produced a point prediction; the negative binomial added run-scoring variance on top of it, but the model itself reported no confidence band.
Feature engineering was a maintenance burden. Twelve to fourteen features required ongoing tuning of split blends, IP weights, and fallback rules. v2 replaces all of that with per-actor posteriors learned from raw PAs.

v1 predictions before May 12, 2026 are still served from the archive tables (model_outputs_v1_archive, model_outputs_season_v1_archive) so the historical record is unchanged. A green vertical line on every time-series chart marks the cutover.