⌂ front page · command center
rendered 2026-07-03T12:35:48Z

Calibration methodology — deriving and tracking BOTH numbers

"How close to reality are we?" MacroGuru publishes two numbers for every scenario: the probability it happens at all, and the % impact on each asset if it does. This doc is the contract for how each one is derived from history, how the uncertainty (including unknown-unknowns) is carried, and how we score ourselves once reality settles it.

Engine: macroguru/calibration/derive.py. Wired in bin/build_history.py (derivation) and bin/build_scorecard.py (scoring → /reality-check). Learn loop: bin/recalibrate.py. See also SCENARIO_LIFECYCLE_SOP.md §15C.

Nothing here is asserted. Each number is built, the build is recorded so it can be audited, and after the event it is scored — wins and losses, on the record.


1. The probability (will it happen?)

Reference-class forecasting (Kahneman's outside view, Tetlock/GJP), made transparent.

Decomposition: p = p_class × variant_share. The class rate — does an event of this mechanism class happen within the window at all? — is empirical, measured from the dated, sourced event library itself (macroguru/calibration/base_rates.py, published at /data/base_rates.json). The variant's share within its class — which oil disruption, which credit event — is the analyst's editorial call, pooled toward the class mean and labeled as editorial in every waterfall. The seam between measured and editorial is drawn exactly where it exists.

Inputs. The assigned prior p₀ (authored per the SOP), the scenario's measured class (its rarest required mechanism tag: min-λ over its tags, excluding the near-universal risk modifiers), the count of matched analogues n, and the crowd where a pinned market exists (Polymarket/Kalshi).

The class rate (each field published in base_rates.json):

Build (each step is stored in prob_derivation.breakdown):

  1. Share pooling. variant_share = p₀ / p_class, shrunk toward the class's mean authored share with weight w = n/(n+6). Thin precedent ⇒ pulled toward the class mean; rich precedent ⇒ the analyst's share stands. after_pooling = p_class × share_pooled.
  2. Signal pool. Combine with the crowd in log-odds (geometric mean of odds — preserves a confident minority signal). Extremizing is retired (×1.0): single-signal extremizing is unjustified; it returns only if our own resolved ledger shows it adds skill.
  3. Unknown-unknowns + the class ceiling. Reserve 3% for the unmodelled, clamp to [2%, 97%] (Cromwell's rule: never 0 or 1) — and cap at p_class: a variant can never outrun its own class.
  4. Credible interval. Beta-style width from the effective precedent count n+6, widened when signals disagree and by the class-rate interval when the class itself is thin. Under thin evidence a single precise number is false precision, so we show a range.

Surfaced on every scenario page / mindmap as "Empirically anchored X% · 90% range A–B%" with a one-click waterfall of the steps above.

Methodology change — 2026-07-02. Until this date the anchor was the mean assigned probability of scenarios sharing a (category, timeline) — our own priors averaged, i.e. circular. It is replaced by the measured class rate above. Typical scenarios moved ±1–5pp; the ceiling binds nowhere in the current book (the book never claimed more than its class allows). Extremizing (was ×1.12) is retired the same day. The old behaviour is preserved in git history; this note is the public record of the change.

Same-day refinement (2026-07-02, founder review). Three class-assignment defects fixed: (1) the credit root-shock mapping was unsigned — bullish spread-tightening scenarios landed in the credit-stress class; now sign-aware (also trade-tension/recession/pandemic). (2) Tag-vocabulary equivalence: scenario tags and event tags drifted (events say banking_crisis where scenarios say credit), making some classes look far rarer than they are — class rates for drifted tags are now measured over the union of equivalent event tags (members listed in base_rates.json; credit λ 0.87→4.84/yr). (3) Scheduled catalysts (index go-lives, halvings, expiries, effective dates) are not Poisson arrivals — occurrence is calendar-certain, so the class shows as "scheduled" and the probability is the market-impact call, which stays editorial and labeled.

Cadence — when the numbers actually update, and how news moves them

Stated plainly, because "live" claims rot:


2. The impact % (how far does each asset move?)

Event-study abnormal returns (MacKinlay; Kothari–Warner), shrunk toward what we actually measured.

Inputs. The cascade reaction-function prior expected_pct (the propagation-graph number), and — from build_history.py — the measured abnormal return AR = r − (α + β·r_SPX) averaged (recency- and similarity-weighted) over this scenario's historical analogues, with its sample size n, directional hit_rate, and event-study confidence.

Build (stored per market as impact):

  1. Reliability. 0.5·min(n,20)/20 + 0.3·max(0, 2·hit_rate−1) + 0.2·confidence — sample size × consistency × significance, in [0,1].
  2. Shrink the measured mean toward zero by 1 − reliability (few/noisy analogues collapse to "no reliable move" — James–Stein / empirical-Bayes intuition).
  3. Blend the shrunk measured value with the cascade prior, weighting measured by its reliability. The published central number is this blend mu.
  4. Band. A fat-tail-aware spread from cross-analogue dispersion (unreliable estimates get a wider band) plus a tail figure (realized macro moves run 2–4× the median).

Surfaced on the mindmap reasoning panel as "hist A–B%" next to each projected move.


3. Tracking — the variance, after the fact

Two ledgers, because there are two numbers. We hold ourselves to both.

Probability → proper scoring on the resolved-forecast ledger (predictions_log.jsonbuild_scorecard.py): one-sided Brier (0 perfect, 0.25 no-skill, 1 confident-wrong), Murphy reliability/resolution decomposition, Brier skill score vs base rate and vs the crowd, calibration-by-bucket with Wilson bands.

Impact % → the magnitude backtest (impact_accuracy.json): across every scenario×asset cell that has both a published move and a measured analogue move, we score the published % against history with MAE / RMSE / bias / directional hit-rate / interval coverage and a skill score vs the naive "no-move" baseline (CRPS for a point forecast reduces to MAE — Gneiting–Raftery). Broken out by confidence band and by direction agreement. This is in-sample consistency — does what we publish match what history shows — not a forward test; the forward, out-of-sample magnitude record accrues in the predictions ledger as real events resolve.

Both are public at /reality-check.

Founding read (Jun 2026). Over 209k %-cells the published impact was MAE ≈ 3.6pp, direction ≈ 50% (a coin flip), skill ≈ 0 vs "no move", and direction did not improve with our stated confidence. That is the honest bar we are now measured against — and exactly why the impact engine now shrinks each published number toward the measured record, and why the learn loop exists.


4. The learning loop

bin/recalibrate.py reads both ledgers and proposes corrections (non-destructive — a human reviews, then applies via scenario_overrides.yaml / propagation.py, mirroring the existing apply_probs flow):

Each cycle: derive → publish with its basis + band → score on resolution → propose corrections → review & apply → re-derive. The gap to reality is the objective; this loop closes it.


Sources

Reference-class / outside view (Kahneman; Tetlock GJP); Beta-Binomial conjugate base rate (Jeffreys pseudo-counts); James–Stein / empirical-Bayes shrinkage; logarithmic opinion pool + extremizing (Satopää et al.; Neyman–Roughgarden on when to (anti-)extremize); Cromwell's rule & imprecise probabilities (Knightian uncertainty); event-study abnormal returns (MacKinlay 1997; Kothari–Warner 2007); Kish effective sample size; CRPS / pinball / interval score / coverage (Gneiting–Raftery 2007; Hyndman & Athanasopoulos); MAPE pitfalls → MAE/MASE (Hyndman & Koehler 2006); skill scores (Murphy 1988). Full URLs in the build-research notes.