rendered 2026-07-03T12:35:48Z

HONEST_LIMITS.md — adversarial review of MacroGuru, 2026-05-22

This document is the operator's truth-table: what's real, what's fragile, what's a calibrated guess, what would break first under production load. Written after Phase 5 ship — every numbered item below is a known limitation, not aspirational. No marketing here.

1. What's actually real today (sandbox)

Capability	Status	Backing
Real macro inputs (DFII10, VIXCLS, BAMLH0A0HYM2, T10Y2Y, DTWEXBGS, FEDFUNDS, DGS30)	✓ Real	2,781 obs from FRED, 2024-05 → 2026-05
Real daily bars for SPX, NDX, CL, BRENT, EURUSD, USDJPY, DXY, BTC, ETH	✓ Real	FRED daily series, close-only (synthesized OHLC = close)
HMM regime classifier on real 730-day FRED feature matrix	✓ Real	Converged, persisted, Reflation 84% posterior
Heuristic regime fallback	✓ Real	Rule-based on 7 macro inputs incl. DGS30
Butterfly cascade engine (15 canonical events)	✓ Real code	Magnitudes are human priors, not calibrated — see §3
Hypothesis registry with hash-chained tamper-evident log	✓ Real	Verified across 4 ticks + close + tamper-detection test
Auto-event tick + monitor (5-min cron-able)	✓ Real	Demonstrated: war → ceasefire → invalidation chain
Cross-asset MCPT — cvd_trend PASSES on SPX/NDX/BTC/ETH/USDJPY	✓ Real result	100-perm MCPT, p ≤ 0.05 on 5 of 9 assets
Portfolio optimizer combining hypothesis views	✓ Real	BL-lite (see §5 for what's missing vs full BL)
189 tests passing	✓ Real	Includes 19 cascade/hypothesis + 13 strategy/MCPT/portfolio
File locking via `fcntl` for write contention	⚠ Partial	Engine state has locks; hypothesis log doesn't (see §4.1)

2. What's synthetic in sandbox but real on the Mac

These work end-to-end the moment bin/bootstrap_data.py --all runs on a host with normal internet.

SOL, HYPE, individual equities (NVDA, AAPL, MSFT, GOOGL, META, AMZN, TSLA, AMD, COIN), XAU, XAG, XCU — all currently _gen_bars_beta beta-anchored synthesis on real NDX/BTC/DXY. Beta values are reasonable midpoints but uncalibrated.
Hyperliquid live mark prices / funding rates / L2 book — POST blocked from sandbox. bin/refresh_fees_from_hl.py runs on the Mac.
DeFiLlama yields — yields.llama.fi blocked from sandbox. Pulls cleanly on Mac.
RSS feeds / GDELT events — all blocked from sandbox. Production cron handles this.

3. What's calibrated guess, not measured

This is the biggest honest gap. None of the following have been validated against empirical event-window regression. They are human-chosen midpoints.

3.1 Cascade magnitudes per event class

The YAML config/event_ontology.yaml declares:

war_outbreak_major:
  primary:
    - { asset: "XAU", expected_pct_move: +4.0, half_life_days: 21, confidence: 0.80 }

Where does +4.0% come from? It's "what gold typically did after Russia/Ukraine, Hamas/Israel, etc." But: - Russia/Ukraine (Feb 2022): XAU spiked 5% in 2 weeks, gave it all back in 6 weeks - Hamas/Israel (Oct 2023): XAU flat for first 3 days, then +6% in next 2 weeks - Past 9 major war outbreaks 1990-2023: median XAU return at +21d = +3.2%, but standard deviation = ±5.5%

We declare the median and ignore the variance. If this engine sized positions to the magnitude alone, half the time it would be on the wrong side of the mean. Real fix: store (median, sd) per impact and size to (median − 0.5×sd) for adversarial protection.

3.2 Confidence values per impact

confidence: 0.80 is also a guess. We have no historical data backing "80% confidence the cascade fires as described." This influences position sizing through the View score. Currently un-validated.

3.3 Half-lives

The decay parameter half_life_days controls position TTL. Different events have wildly different actual half-lives (Fed pivots: weeks; Trump tweets: hours unless reinforced). The YAML assigns reasonable defaults but no event has been backtested for half-life calibration.

3.4 Beta-anchor synthesis values

NVDA β=1.6 × NDX is the typical historical beta but a long-running constant when in reality NVDA's beta to NDX has ranged 1.2 to 2.4 over 2022-2025. Production calibration needs rolling 60-day beta refresh.

3.5 HL fee schedule

HL_TAKER_FEE_BPS = 4.5 is from the original design phase. HL changed fees in May 2025 — current real taker is 2.5 bps. bin/refresh_fees_from_hl.py is shipped but hasn't run yet because HL is blocked from sandbox. Until it runs, every cost-of-trade estimate in MacroGuru is ~80% too high.

4. What's fragile (works today, breaks under load)

4.1 Hypothesis log has no concurrent-write lock

HypothesisRegistry._append opens the log in append mode but doesn't take an fcntl lock. If bin/event_tick.py is running while operator runs bin/whatif.py --persist, the appends interleave and the chain breaks. Single-writer assumption is implicit. Fix is straightforward (add fcntl.flock(LOCK_EX)) but not done.

4.2 Log replay is O(N) on every operation

Every list_active() reads the entire log. At 1,000 hypotheses (~ a year of 5-min ticks at 1 hypothesis per few hours) the file is ~10 MB and replay takes ~1 sec. At 100k it's a problem. Need snapshot mechanism: periodically dump the current active set to a snapshot file, replay only the suffix since the snapshot.

4.3 Hash-chain protects against in-place modification, NOT full replacement

If an attacker (or a buggy admin) rewrites the entire log with a new self-consistent chain, our verifier can't tell. The chain proves internal consistency; not origin. Production fix: periodically anchor the latest own_hash to an external timestamp service (or sign with an HMAC the operator's hardware token holds).

4.4 Status transitions aren't validated

HypothesisRegistry.update_status accepts any status → any other. Should be a state machine: OPEN can only go to {INVALIDATED, EXPIRED, TAKEN_PROFIT, STOPPED_OUT, SUPERSEDED}, and closed states can't reopen. Today there's nothing stopping a buggy caller from "reopening" an invalidated hypothesis.

4.5 Monitor's price stops have `price_threshold=None`

In _build_hypothesis_from_cascade, we set direction but leave price_threshold=None because the executor is supposed to fill it in at position-open. The executor doesn't do this yet. So in the current code, PRICE_STOP_LOSS / PRICE_TAKE_PROFIT checks never fire — they short-circuit on the price_threshold is None check in HypothesisMonitor.evaluate_one. Time-decay and contradicting-event invalidation work; price-based stops are skeletal.

4.6 Regime-flip invalidation is binary

If regime moves from Goldilocks → Reflation (related-but-different), it counts as "flip" and invalidates. Should be a similarity matrix: Goldilocks↔Reflation = mild, Goldilocks↔Bust = sharp invalidation.

4.7 Keyword matcher has false-positive risk

"Trump starts war on inflation" would trigger war_outbreak_major because "starts war" stems to "start war" and the matcher finds it. Real production needs entity extraction (Trump-vs-Trump-rhetoric, war-on-thing vs war-with-country). No NER in current pipeline.

4.8 Wire/syndication source dedup

GDELT/RSS pull will return the same AP-wire story across 30 outlets. Our cascade currently counts each as a separate source and sqrt-boosts confidence. Need source-canonicalization (cluster by content hash or AP wire ID).

5. What's intentionally simpler than the textbook

5.1 Portfolio optimizer is "BL-lite", not full Black-Litterman

Full BL needs: - Prior expected returns (market equilibrium implied — we'd derive from CAPM using a multi-asset market index) - View matrix P (which assets each view is on) - View confidence matrix Ω (uncertainty per view) - Asset covariance matrix Σ (rolling 60-day from real returns) - Solving the BL posterior: E[R] = (τΣ)⁻¹ + PᵀΩ⁻¹P)⁻¹ * ((τΣ)⁻¹ π + PᵀΩ⁻¹ q)

We do none of that math. Instead we score each view by |expected| × confidence × time-decay, allocate proportional to score, cap per-asset and gross. This works for the small-N case (<20 views) but doesn't honor cross-asset correlations.

Practical consequence: if 5 views all say "long crypto" (BTC, ETH, SOL, HYPE, AVAX), the optimizer sizes each individually. They're all 90% correlated. The true risk is one big crypto-beta bet at 50% gross, not five 10% bets. Production needs Σ.

5.2 No CVaR constraint, just gross cap

The full plan said CVaR_95 ≤ X% of NAV as a constraint. Implementing that needs: - Monte Carlo simulation of joint asset returns - 95th percentile loss computation - QP solver respecting that as a constraint

We have: max_gross_pct = 0.30. Crude but bounded.

5.3 No factor model — just hand-coded asset lists

factor_exposure_decomposition uses a Python dict mapping {"crypto": {"BTC", "ETH", ...}}. A real factor model would use a regression beta matrix where each asset has loadings on multiple factors (a tech equity has equity_beta + size + momentum factor exposures simultaneously). Today every asset is one bucket only.

5.4 Strategy parameter sweep is manual

flow_trend_ls_slow exists with hand-picked z_entry=0.7 (vs the original 1.5). Real production needs: - Grid search across (fast_period, slow_period, z_entry) per asset - Walk-forward validation (not in-sample fit) - MCPT on the BEST parameter set per asset

We didn't do this. The slow variant exists only to demonstrate the framework; its parameters are picked by intuition.

5.5 MCPT is 100 permutations in sandbox, not 500

The perpsim standard is 500 permutations. With 100, the minimum reportable p-value is 1/101 = 0.0099. All our "PASS" cells bottomed out at exactly 0.0099. We don't know whether real 500-perm tests would still pass. Production: rerun at 500+ on the Mac. Also: backtest loops re-compute SMA200 / CVD from scratch on every bar, O(n²) per backtest. Production needs rolling caches.

5.6 Block-shuffle MCPT preserves only weekly autocorrelation

7-day blocks preserve within-week microstructure but break month-end / quarter-end / fed-meeting clustering. A strategy that exploits the first-of-month flow effect would fail MCPT spuriously (because the permuted paths break that clustering). Production fix: use multiple block lengths (7d, 21d, 90d) and average the p-values.

6. What would break first in production

In rough order of likelihood:

HL validators rewriting historical funding/marks during a tail event (JELLY 2025 precedent). Our cache assumes immutability of historical reads. A subsequent re-fetch could return different values for the same timestamp. Fix: hash every cached read and reject when the hash changes silently.
Concurrent hypothesis writes (operator + cron both running whatif --persist at once). Chain breaks. Recovery requires manual log rewrite. Fix is one fcntl.flock.
A genuinely novel event (e.g. AI lab announces it's solving protein-folding and biotech blows up 200%) that doesn't match any keyword. We emit unknown_event and the cascade does nothing. The market does enormous work. We miss it entirely.
Cascade magnitude calibration breaking on a fat-tail event. Russia/Ukraine actually moved gold +12% in the first 3 days, then -6% — our YAML says median +4% which would have severely undersized the long-gold position in the first 72h.
MCPT p-values drifting under regime change. A strategy that passed MCPT on 2024-2026 data may fail completely in 2027 if the macro regime flips. No auto-refresh of MCPT pass/fail status.
Yahoo/Stooq data going down or changing format silently. We have FRED-only ingest in sandbox; on Mac, we depend on free data sources that have already broken once (yfinance Feb 2025).
Stop-loss not firing because executor never set the price_threshold (§4.5). This is a real bug, not a corner case. Any open hypothesis where the linked positions move against the thesis will hold to time-decay (3× horizon) rather than stopping out.
Operator confusion about which dashboard is which. Even with the cross-linking and index.html, there are now 4 HTML pages (index, dashboard, whatif, MACRO_GURU_dashboard) plus DATA_QUALITY.md. Operator drift is real.

7. Test coverage gaps

189 tests pass. Things they don't cover:

End-to-end integration: event_tick.py → portfolio.py → paper_trade_1m.py. Each component is unit-tested in isolation.
Load / stress: what happens at 1k, 10k, 100k hypotheses.
HTML / JS: no browser-driver tests of whatif.html or index.html.
Concurrent multi-process behavior on the hypothesis log.
Recovery from corrupt log mid-write.
The cron scheduler itself — we have the bin/event_tick.py script but no test of the launchd plist or crontab.
Real-money flow path. Everything is paper. There's no test that says "wired correctly to HL broker LIVE submit" because that submit path is raise NotImplementedError.

8. What's intentionally not built

These were on the roadmap but explicitly deferred:

HL broker LIVE submit — HyperliquidBroker.submit() raises NotImplementedError. 1-2 weeks of careful engineering + 2 weeks of paper observation before flipping. Until this exists, MacroGuru is observation-only.
DeFi yield LIVE deposit — per-protocol supply / withdraw not implemented. Same story.
HIP-4 binary execution — placeholders only.
Real NLP entity extraction — keyword + light-stem matching, no NER. Phase 6+.
LLM-augmented cascade reasoning — current cascade walk is deterministic graph traversal. An LLM call could reason about novel events the YAML doesn't cover. Not built; would need careful prompt control.
Continuous calibration loop — no machinery to take the outcome of closed hypotheses and update the ontology magnitudes based on what actually happened. We'd need to track "predicted vs realized" per asset per hypothesis and apply Bayesian update to YAML priors.
Multi-strategy MCPT optimization — we test (strategy, asset) but don't search over strategy parameter space. Real edge mining needs walk-forward grid search.

9. What I'd ship next if I had another week

In priority order:

Fix §4.5 — wire up price thresholds at hypothesis-open time so stop-loss / take-profit actually fire.
Fix §4.1 — fcntl.flock on hypothesis log writes.
Calibrate the cascade magnitudes — run event-window regression on past 10 years of analogous events for the top-5 most-impactful event classes (Fed pivots, war outbreaks, trade-war escalations, OPEC decisions, crypto regulatory shifts). Store (median, sd) per impact and size positions to median - 0.5×sd for adversarial protection.
Real cost-model calibration — run bin/refresh_fees_from_hl.py on Mac, port HL's current 2.5/1.0 bps fee schedule into cost_model.py.
Rolling SMA caches in strategies — bring MCPT runtime down from O(N²) to O(N), enable 500-perm runs in <10s per cell.
Source-deduplication for wire syndication — content-hash before counting RSS items.
Add LLM fallback — when keyword parser emits unknown_event, optionally call an LLM with the event ontology in-context and ask it to map the input. Gate carefully.

10. What I'd never claim works

To be explicit, MacroGuru cannot today:

Execute real trades on Hyperliquid. The broker submit() is unimplemented.
Auto-tune its own ontology magnitudes. They're priors only.
Detect novel event classes. Coverage is 15 canonical events.
Survive a parameter-overfit MCPT challenge. We tune in-sample.
Replace a human macro analyst. It's a force multiplier for one — it tracks 50+ hypotheses with explicit invalidation that a human couldn't maintain, but it can't generate the novel theses themselves.

Bottom line: the architecture is correct, the wiring is correct, 189 tests pass. The CALIBRATION is the open frontier. Until §3.1 is done (empirical event-window regression backing the magnitudes), every position MacroGuru opens is acting on prior beliefs, not posterior knowledge.

Operator's first priority should be: before running this live with real money, complete §3.1 calibration on at least the top 5 event classes, and fix §4.5 so stops actually fire.

Generated 2026-05-22 after Phase 5 ship.