◎ Reality Check

← Front page · Command center · Live Monitor · Scenario Lab

The only score that matters: how close to reality are we? Two numbers per scenario — will it happen, and how far each market moves — both checked against what actually happened. Wins and losses, in public. Nothing cherry-picked.

Loading the scorecard…

Calibration — when we say 70%, does it happen ~70% of the time?

This is the chart that will convict or acquit us. Every resolved call adds a dot; over time the bars either sit on the line or they don't. We can't edit it — resolutions are hash-chained.

Bars = the share that actually came true (accent) vs the average probability we assigned (gold tick) in each band. On the line = perfectly calibrated. Small bands are noisy — counts shown at right. Recipe: methodology.

Cascade backtest · our starting baseline — not a track record

…

Across every (scenario × asset) cell with both a forward cascade direction and a measured historical abnormal return: how often they point the same way. ~50% is a coin flip — we publish it anyway. It's the baseline the learning loop has to beat, in public. Divergence is where the model earns or loses its keep; the forecast log above is the live, forward record.

The fix-log · what our own scoring caught, and what we're doing about it

We grade ourselves before anyone else can. Every defect our scoring battery finds is listed here with its fix — newest first.

2026-07-03 · Daily refresh wasn't reaching prod — FIXED. The 07:11 rebuild re-derived every probability daily but its deploy step still used a retired CLI path and failed silently — the site only updated on manual pushes. Deploy switched to the sanctioned git-push path; the methodology now states every cadence plainly, including the one that matters: news moves probabilities only via verified events → measured class rates, never a sentiment dial.
2026-07-02 · Class over-cutting — FIXED same day (founder review). The new measured classes over-cut three kinds of scenarios: bullish credit-easing calls mis-tagged into the credit-stress class (unsigned root mapping), classes that looked rare only because scenario and event vocabularies drifted (credit λ 0.87→4.84/yr once unioned with banking_crisis), and calendar-certain catalysts (index go-lives, halvings) treated as random arrivals. All three fixed and re-derived; the methodology note records the details.
2026-07-02 · Circular base rate — FIXED. The probability anchor used to be the mean of our own authored priors (opinions averaged, dressed as a base rate). Now the class rate is measured: decade-normalized event frequency from the dated, sourced library (the full table is public), with a hard cap — no scenario can claim more than its class. The editorial part (the variant's share within its class) is labeled as editorial in every waterfall. Extremizing retired (×1.0) until our own resolved ledger justifies it.
2026-07-02 · Interval bands too narrow. Our "90%" impact bands caught reality only ~45% of the time. Fix in build: bands re-derived from analogue dispersion, tuned to real 90% coverage.
2026-07-02 · Confidence score is anti-predictive. Highest-confidence impact calls were the least accurate band (42% direction vs 55% mid). Fix in build: confidence is being replaced by an empirically-fitted reliability table; until then, don't weight by it.
2026-07-02 · Impact bias −0.75pp. We systematically under-stated moves by ~0.75 points. The signed-mechanism conditioning fix (aligned-only headline + published counter-case) moved the battery to +0.12pp, direction 49.9%→58.8%, skill −1.1%→+2.2%.
2026-07-02 · Probability badge over-claimed. The old label implied the number was a measured frequency when its anchor was still our library's own prior. Relabeled honestly the same day; superseded by the measured base-rate engine above.

The ledger · every call, on the record

…

Method: one-sided Brier score in [0,1] (0 = perfect, 0.25 = no-skill coin-flip, 1.0 = confident-and-wrong), with the Murphy reliability/resolution decomposition and a Brier-skill-score vs the base rate and vs the crowd. Scheduled-certain calls (a report lands, a market is closed) are shown but excluded from the skill score. Grounded in Tetlock's Good Judgment Project, Murphy (1973), and Metaculus track-record practice.

Calibration — when we say 70%, does it happen ~70% of the time?

Cascade backtest · our starting baseline — not a track record

Impact accuracy · our own grade, published first

The fix-log · what our own scoring caught, and what we're doing about it

The ledger · every call, on the record