"How close to reality are we?" MacroGuru publishes two numbers for every scenario: the probability it happens at all, and the % impact on each asset if it does. This doc is the contract for how each one is derived from history, how the uncertainty (including unknown-unknowns) is carried, and how we score ourselves once reality settles it.
Engine:
macroguru/calibration/derive.py. Wired inbin/build_history.py(derivation) andbin/build_scorecard.py(scoring →/reality-check). Learn loop:bin/recalibrate.py. See also SCENARIO_LIFECYCLE_SOP.md §15C.
Nothing here is asserted. Each number is built, the build is recorded so it can be audited, and after the event it is scored — wins and losses, on the record.
Reference-class forecasting (Kahneman's outside view, Tetlock/GJP), made transparent.
Decomposition: p = p_class × variant_share. The class rate — does an event of this
mechanism class happen within the window at all? — is empirical, measured from the dated,
sourced event library itself (macroguru/calibration/base_rates.py,
published at /data/base_rates.json).
The variant's share within its class — which oil disruption, which credit event — is the
analyst's editorial call, pooled toward the class mean and labeled as editorial in every
waterfall. The seam between measured and editorial is drawn exactly where it exists.
Inputs. The assigned prior p₀ (authored per the SOP), the scenario's measured class
(its rarest required mechanism tag: min-λ over its tags, excluding the near-universal risk
modifiers), the count of matched analogues n, and the crowd where a pinned market
exists (Polymarket/Kalshi).
The class rate (each field published in base_rates.json):
λ_tag = share × λ_all. Jeffreys +0.5 on thin tags; a ~90% Gamma
interval (Wilson–Hilferty) that the final credible interval inherits when the class is thin.p_class = 1 − exp(−λ·W) — Poisson arrival within the scenario's stated horizon.Build (each step is stored in prob_derivation.breakdown):
variant_share = p₀ / p_class, shrunk toward the class's mean authored
share with weight w = n/(n+6). Thin precedent ⇒ pulled toward the class mean; rich
precedent ⇒ the analyst's share stands. after_pooling = p_class × share_pooled.p_class: a variant can
never outrun its own class.n+6, widened
when signals disagree and by the class-rate interval when the class itself is thin.
Under thin evidence a single precise number is false precision, so we show a range.Surfaced on every scenario page / mindmap as "Empirically anchored X% · 90% range A–B%" with a one-click waterfall of the steps above.
Methodology change — 2026-07-02. Until this date the anchor was the mean assigned probability of scenarios sharing a
(category, timeline)— our own priors averaged, i.e. circular. It is replaced by the measured class rate above. Typical scenarios moved ±1–5pp; the ceiling binds nowhere in the current book (the book never claimed more than its class allows). Extremizing (was ×1.12) is retired the same day. The old behaviour is preserved in git history; this note is the public record of the change.Same-day refinement (2026-07-02, founder review). Three class-assignment defects fixed: (1) the credit root-shock mapping was unsigned — bullish spread-tightening scenarios landed in the credit-stress class; now sign-aware (also trade-tension/recession/pandemic). (2) Tag-vocabulary equivalence: scenario tags and event tags drifted (events say
banking_crisiswhere scenarios saycredit), making some classes look far rarer than they are — class rates for drifted tags are now measured over the union of equivalent event tags (memberslisted in base_rates.json; credit λ 0.87→4.84/yr). (3) Scheduled catalysts (index go-lives, halvings, expiries, effective dates) are not Poisson arrivals — occurrence is calendar-certain, so the class shows as "scheduled" and the probability is the market-impact call, which stays editorial and labeled.
Stated plainly, because "live" claims rot:
p_class = 1 − exp(−λ·window) shifts → every scenario in that class re-derives at the next
daily build. News moves the measured base rate, never a sentiment dial.synthetic),
anchored to today's estimate — not a live re-estimation history. Real daily probability
snapshots are on the roadmap (walk-forward phase).Event-study abnormal returns (MacKinlay; Kothari–Warner), shrunk toward what we actually measured.
Inputs. The cascade reaction-function prior expected_pct (the propagation-graph
number), and — from build_history.py — the measured abnormal
return AR = r − (α + β·r_SPX) averaged (recency- and similarity-weighted) over this
scenario's historical analogues, with its sample size n, directional hit_rate, and
event-study confidence.
Build (stored per market as impact):
0.5·min(n,20)/20 + 0.3·max(0, 2·hit_rate−1) + 0.2·confidence — sample size
× consistency × significance, in [0,1].1 − reliability (few/noisy analogues collapse
to "no reliable move" — James–Stein / empirical-Bayes intuition).mu.tail figure (realized macro moves run 2–4× the median).Surfaced on the mindmap reasoning panel as "hist A–B%" next to each projected move.
Two ledgers, because there are two numbers. We hold ourselves to both.
Probability → proper scoring on the resolved-forecast ledger (predictions_log.json →
build_scorecard.py): one-sided Brier (0 perfect, 0.25
no-skill, 1 confident-wrong), Murphy reliability/resolution decomposition, Brier skill
score vs base rate and vs the crowd, calibration-by-bucket with Wilson bands.
Impact % → the magnitude backtest (impact_accuracy.json): across every
scenario×asset cell that has both a published move and a measured analogue move, we score the
published % against history with MAE / RMSE / bias / directional hit-rate / interval
coverage and a skill score vs the naive "no-move" baseline (CRPS for a point forecast
reduces to MAE — Gneiting–Raftery). Broken out by confidence band and by direction
agreement. This is in-sample consistency — does what we publish match what history shows —
not a forward test; the forward, out-of-sample magnitude record accrues in the predictions
ledger as real events resolve.
Both are public at /reality-check.
Founding read (Jun 2026). Over 209k %-cells the published impact was MAE ≈ 3.6pp, direction ≈ 50% (a coin flip), skill ≈ 0 vs "no move", and direction did not improve with our stated confidence. That is the honest bar we are now measured against — and exactly why the impact engine now shrinks each published number toward the measured record, and why the learn loop exists.
bin/recalibrate.py reads both ledgers and proposes corrections
(non-destructive — a human reviews, then applies via scenario_overrides.yaml /
propagation.py, mirroring the existing apply_probs flow):
logit(p) ← a + b·logit(p) fit on the resolved set; and where the
history-derived probability diverges from the assigned prior on thin precedent, flag the
scenario for review.Each cycle: derive → publish with its basis + band → score on resolution → propose corrections → review & apply → re-derive. The gap to reality is the objective; this loop closes it.
Reference-class / outside view (Kahneman; Tetlock GJP); Beta-Binomial conjugate base rate (Jeffreys pseudo-counts); James–Stein / empirical-Bayes shrinkage; logarithmic opinion pool + extremizing (Satopää et al.; Neyman–Roughgarden on when to (anti-)extremize); Cromwell's rule & imprecise probabilities (Knightian uncertainty); event-study abnormal returns (MacKinlay 1997; Kothari–Warner 2007); Kish effective sample size; CRPS / pinball / interval score / coverage (Gneiting–Raftery 2007; Hyndman & Athanasopoulos); MAPE pitfalls → MAE/MASE (Hyndman & Koehler 2006); skill scores (Murphy 1988). Full URLs in the build-research notes.