This document is the operator's truth-table: what's real, what's fragile, what's a calibrated guess, what would break first under production load. Written after Phase 5 ship — every numbered item below is a known limitation, not aspirational. No marketing here.
| Capability | Status | Backing |
|---|---|---|
| Real macro inputs (DFII10, VIXCLS, BAMLH0A0HYM2, T10Y2Y, DTWEXBGS, FEDFUNDS, DGS30) | ✓ Real | 2,781 obs from FRED, 2024-05 → 2026-05 |
| Real daily bars for SPX, NDX, CL, BRENT, EURUSD, USDJPY, DXY, BTC, ETH | ✓ Real | FRED daily series, close-only (synthesized OHLC = close) |
| HMM regime classifier on real 730-day FRED feature matrix | ✓ Real | Converged, persisted, Reflation 84% posterior |
| Heuristic regime fallback | ✓ Real | Rule-based on 7 macro inputs incl. DGS30 |
| Butterfly cascade engine (15 canonical events) | ✓ Real code | Magnitudes are human priors, not calibrated — see §3 |
| Hypothesis registry with hash-chained tamper-evident log | ✓ Real | Verified across 4 ticks + close + tamper-detection test |
| Auto-event tick + monitor (5-min cron-able) | ✓ Real | Demonstrated: war → ceasefire → invalidation chain |
| Cross-asset MCPT — cvd_trend PASSES on SPX/NDX/BTC/ETH/USDJPY | ✓ Real result | 100-perm MCPT, p ≤ 0.05 on 5 of 9 assets |
| Portfolio optimizer combining hypothesis views | ✓ Real | BL-lite (see §5 for what's missing vs full BL) |
| 189 tests passing | ✓ Real | Includes 19 cascade/hypothesis + 13 strategy/MCPT/portfolio |
File locking via fcntl for write contention |
⚠ Partial | Engine state has locks; hypothesis log doesn't (see §4.1) |
These work end-to-end the moment bin/bootstrap_data.py --all runs on a host with normal internet.
_gen_bars_beta beta-anchored synthesis on real NDX/BTC/DXY. Beta values are reasonable midpoints but uncalibrated.bin/refresh_fees_from_hl.py runs on the Mac.yields.llama.fi blocked from sandbox. Pulls cleanly on Mac.This is the biggest honest gap. None of the following have been validated against empirical event-window regression. They are human-chosen midpoints.
The YAML config/event_ontology.yaml declares:
war_outbreak_major:
primary:
- { asset: "XAU", expected_pct_move: +4.0, half_life_days: 21, confidence: 0.80 }
Where does +4.0% come from? It's "what gold typically did after Russia/Ukraine, Hamas/Israel, etc." But: - Russia/Ukraine (Feb 2022): XAU spiked 5% in 2 weeks, gave it all back in 6 weeks - Hamas/Israel (Oct 2023): XAU flat for first 3 days, then +6% in next 2 weeks - Past 9 major war outbreaks 1990-2023: median XAU return at +21d = +3.2%, but standard deviation = ±5.5%
We declare the median and ignore the variance. If this engine sized positions to the magnitude alone, half the time it would be on the wrong side of the mean. Real fix: store (median, sd) per impact and size to (median − 0.5×sd) for adversarial protection.
confidence: 0.80 is also a guess. We have no historical data backing "80% confidence the cascade fires as described." This influences position sizing through the View score. Currently un-validated.
The decay parameter half_life_days controls position TTL. Different events have wildly different actual half-lives (Fed pivots: weeks; Trump tweets: hours unless reinforced). The YAML assigns reasonable defaults but no event has been backtested for half-life calibration.
NVDA β=1.6 × NDX is the typical historical beta but a long-running constant when in reality NVDA's beta to NDX has ranged 1.2 to 2.4 over 2022-2025. Production calibration needs rolling 60-day beta refresh.
HL_TAKER_FEE_BPS = 4.5 is from the original design phase. HL changed fees in May 2025 — current real taker is 2.5 bps. bin/refresh_fees_from_hl.py is shipped but hasn't run yet because HL is blocked from sandbox. Until it runs, every cost-of-trade estimate in MacroGuru is ~80% too high.
HypothesisRegistry._append opens the log in append mode but doesn't take an fcntl lock. If bin/event_tick.py is running while operator runs bin/whatif.py --persist, the appends interleave and the chain breaks. Single-writer assumption is implicit. Fix is straightforward (add fcntl.flock(LOCK_EX)) but not done.
Every list_active() reads the entire log. At 1,000 hypotheses (~ a year of 5-min ticks at 1 hypothesis per few hours) the file is ~10 MB and replay takes ~1 sec. At 100k it's a problem. Need snapshot mechanism: periodically dump the current active set to a snapshot file, replay only the suffix since the snapshot.
If an attacker (or a buggy admin) rewrites the entire log with a new self-consistent chain, our verifier can't tell. The chain proves internal consistency; not origin. Production fix: periodically anchor the latest own_hash to an external timestamp service (or sign with an HMAC the operator's hardware token holds).
HypothesisRegistry.update_status accepts any status → any other. Should be a state machine: OPEN can only go to {INVALIDATED, EXPIRED, TAKEN_PROFIT, STOPPED_OUT, SUPERSEDED}, and closed states can't reopen. Today there's nothing stopping a buggy caller from "reopening" an invalidated hypothesis.
price_threshold=NoneIn _build_hypothesis_from_cascade, we set direction but leave price_threshold=None because the executor is supposed to fill it in at position-open. The executor doesn't do this yet. So in the current code, PRICE_STOP_LOSS / PRICE_TAKE_PROFIT checks never fire — they short-circuit on the price_threshold is None check in HypothesisMonitor.evaluate_one. Time-decay and contradicting-event invalidation work; price-based stops are skeletal.
If regime moves from Goldilocks → Reflation (related-but-different), it counts as "flip" and invalidates. Should be a similarity matrix: Goldilocks↔Reflation = mild, Goldilocks↔Bust = sharp invalidation.
"Trump starts war on inflation" would trigger war_outbreak_major because "starts war" stems to "start war" and the matcher finds it. Real production needs entity extraction (Trump-vs-Trump-rhetoric, war-on-thing vs war-with-country). No NER in current pipeline.
GDELT/RSS pull will return the same AP-wire story across 30 outlets. Our cascade currently counts each as a separate source and sqrt-boosts confidence. Need source-canonicalization (cluster by content hash or AP wire ID).
Full BL needs:
- Prior expected returns (market equilibrium implied — we'd derive from CAPM using a multi-asset market index)
- View matrix P (which assets each view is on)
- View confidence matrix Ω (uncertainty per view)
- Asset covariance matrix Σ (rolling 60-day from real returns)
- Solving the BL posterior: E[R] = (τΣ)⁻¹ + PᵀΩ⁻¹P)⁻¹ * ((τΣ)⁻¹ π + PᵀΩ⁻¹ q)
We do none of that math. Instead we score each view by |expected| × confidence × time-decay, allocate proportional to score, cap per-asset and gross. This works for the small-N case (<20 views) but doesn't honor cross-asset correlations.
Practical consequence: if 5 views all say "long crypto" (BTC, ETH, SOL, HYPE, AVAX), the optimizer sizes each individually. They're all 90% correlated. The true risk is one big crypto-beta bet at 50% gross, not five 10% bets. Production needs Σ.
The full plan said CVaR_95 ≤ X% of NAV as a constraint. Implementing that needs: - Monte Carlo simulation of joint asset returns - 95th percentile loss computation - QP solver respecting that as a constraint
We have: max_gross_pct = 0.30. Crude but bounded.
factor_exposure_decomposition uses a Python dict mapping {"crypto": {"BTC", "ETH", ...}}. A real factor model would use a regression beta matrix where each asset has loadings on multiple factors (a tech equity has equity_beta + size + momentum factor exposures simultaneously). Today every asset is one bucket only.
flow_trend_ls_slow exists with hand-picked z_entry=0.7 (vs the original 1.5). Real production needs:
- Grid search across (fast_period, slow_period, z_entry) per asset
- Walk-forward validation (not in-sample fit)
- MCPT on the BEST parameter set per asset
We didn't do this. The slow variant exists only to demonstrate the framework; its parameters are picked by intuition.
The perpsim standard is 500 permutations. With 100, the minimum reportable p-value is 1/101 = 0.0099. All our "PASS" cells bottomed out at exactly 0.0099. We don't know whether real 500-perm tests would still pass. Production: rerun at 500+ on the Mac. Also: backtest loops re-compute SMA200 / CVD from scratch on every bar, O(n²) per backtest. Production needs rolling caches.
7-day blocks preserve within-week microstructure but break month-end / quarter-end / fed-meeting clustering. A strategy that exploits the first-of-month flow effect would fail MCPT spuriously (because the permuted paths break that clustering). Production fix: use multiple block lengths (7d, 21d, 90d) and average the p-values.
In rough order of likelihood:
HL validators rewriting historical funding/marks during a tail event (JELLY 2025 precedent). Our cache assumes immutability of historical reads. A subsequent re-fetch could return different values for the same timestamp. Fix: hash every cached read and reject when the hash changes silently.
Concurrent hypothesis writes (operator + cron both running whatif --persist at once). Chain breaks. Recovery requires manual log rewrite. Fix is one fcntl.flock.
A genuinely novel event (e.g. AI lab announces it's solving protein-folding and biotech blows up 200%) that doesn't match any keyword. We emit unknown_event and the cascade does nothing. The market does enormous work. We miss it entirely.
Cascade magnitude calibration breaking on a fat-tail event. Russia/Ukraine actually moved gold +12% in the first 3 days, then -6% — our YAML says median +4% which would have severely undersized the long-gold position in the first 72h.
MCPT p-values drifting under regime change. A strategy that passed MCPT on 2024-2026 data may fail completely in 2027 if the macro regime flips. No auto-refresh of MCPT pass/fail status.
Yahoo/Stooq data going down or changing format silently. We have FRED-only ingest in sandbox; on Mac, we depend on free data sources that have already broken once (yfinance Feb 2025).
Stop-loss not firing because executor never set the price_threshold (§4.5). This is a real bug, not a corner case. Any open hypothesis where the linked positions move against the thesis will hold to time-decay (3× horizon) rather than stopping out.
Operator confusion about which dashboard is which. Even with the cross-linking and index.html, there are now 4 HTML pages (index, dashboard, whatif, MACRO_GURU_dashboard) plus DATA_QUALITY.md. Operator drift is real.
189 tests pass. Things they don't cover:
event_tick.py → portfolio.py → paper_trade_1m.py. Each component is unit-tested in isolation.whatif.html or index.html.bin/event_tick.py script but no test of the launchd plist or crontab.raise NotImplementedError.These were on the roadmap but explicitly deferred:
HL broker LIVE submit — HyperliquidBroker.submit() raises NotImplementedError. 1-2 weeks of careful engineering + 2 weeks of paper observation before flipping. Until this exists, MacroGuru is observation-only.
DeFi yield LIVE deposit — per-protocol supply / withdraw not implemented. Same story.
HIP-4 binary execution — placeholders only.
Real NLP entity extraction — keyword + light-stem matching, no NER. Phase 6+.
LLM-augmented cascade reasoning — current cascade walk is deterministic graph traversal. An LLM call could reason about novel events the YAML doesn't cover. Not built; would need careful prompt control.
Continuous calibration loop — no machinery to take the outcome of closed hypotheses and update the ontology magnitudes based on what actually happened. We'd need to track "predicted vs realized" per asset per hypothesis and apply Bayesian update to YAML priors.
Multi-strategy MCPT optimization — we test (strategy, asset) but don't search over strategy parameter space. Real edge mining needs walk-forward grid search.
In priority order:
fcntl.flock on hypothesis log writes.median - 0.5×sd for adversarial protection.bin/refresh_fees_from_hl.py on Mac, port HL's current 2.5/1.0 bps fee schedule into cost_model.py.unknown_event, optionally call an LLM with the event ontology in-context and ask it to map the input. Gate carefully.To be explicit, MacroGuru cannot today:
submit() is unimplemented.Bottom line: the architecture is correct, the wiring is correct, 189 tests pass. The CALIBRATION is the open frontier. Until §3.1 is done (empirical event-window regression backing the magnitudes), every position MacroGuru opens is acting on prior beliefs, not posterior knowledge.
Operator's first priority should be: before running this live with real money, complete §3.1 calibration on at least the top 5 event classes, and fix §4.5 so stops actually fire.
Generated 2026-05-22 after Phase 5 ship.