feat: add prediction rebuild design spec — Layer 1 ridge model, LLM news overlay, volatility regime detector

Documents complete replacement of six-signal aggregator with calibrated ridge forecaster trained on 435 weeks of BEIS pump prices. Five-layer architecture: weekly baseline (Layer 1), local snapshot (Layer 2), rule-gated verdict merger (Layer 3), daily LLM news
2026-05-01 13:23:10 +01:00
parent e821a934a5
commit 25cf022964
1 changed files with 618 additions and 0 deletions
--- a/docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md
+++ b/docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md
@@ -0,0 +1,618 @@
 # Prediction Rebuild — Design Spec
 ## Context
 The current prediction service (`NationalFuelPredictionService` + six signal
 classes) produces output the user has repeatedly described as "doesn't make
 sense": headlines that contradict their own reasoning text, weights that
 nobody can defend a number on, and confidence values that aren't grounded in
 any track record. Two earlier docs (`.claude/rules/scoring.md`, `.claude/rules/prediction.md`)
 disagree on the weights of the same signals, which is itself evidence that
 the design has drifted.
 This spec replaces the entire prediction stack from scratch around the
 historical data we actually have, with a model whose confidence values are
 calibrated against its own backtested track record.
 Goals:
 - A "fill up now or wait?" call honest about uncertainty.
 - Confidence values calibrated against backtested residuals — "70%" actually
  means "in 7 of every 10 cases like this, the model called direction right".
 - Simple enough to debug a year from now.
 - Remove the six-signal aggregator entirely.
 - Recognise that pump prices, while *measured* weekly by BEIS, can *move* daily
  during oil shocks (Iran, OPEC surprise cuts, Hormuz disruption). The static
  weekly forecast must be backed by a daily news/event overlay so we can flag
  staleness in real time rather than pretend a Monday number is still valid on
  Thursday after a 6% Brent move.
 ---
 ## Inputs (audited 2026-05-01)
 | Source | Status | Use in v1 |
 |---|---|---|
 | `weekly_pump_prices` | 435 weeks, all Mondays, 0 outliers, 1 duty change (Mar 2022, 57.95p → 52.95p), VAT stable at 20% | **Foundation** — train Layer 1 |
 | `station_prices_current` | ~7,550 stations × e10, ~7,620 × b7_standard | **Layer 2** — descriptive snapshot |
 | `stations` | 7,747 stations, 1,989 supermarkets, lat/lng | Layer 2 |
 | `station_prices` | 75 days of changes since 2026-01-16, sample mix uneven per day | Not modelled in v1, but **used by the volatility regime detector** as a churn indicator (% stations changing price / day vs 30-day baseline). |
 | `brent_prices` | 30 days only | **Backfilled in Phase 7** (8 years from FRED, single API call). Used as a Brent-move volatility trigger and as fuel for the daily LLM overlay. |
 The Fuel Finder API has been confirmed empirically to have **no historical
 archive** — `effective-start-timestamp` is a station-level filter on current
 prices, not a time-window query. Per-station deep history can only accrue
 forward from the date polling started.
 ---
 ## Architecture — five thin layers
 ### Layer 1 — National weekly forecaster (predictive, calibrated)
 Trained once weekly on `weekly_pump_prices`. Output:
 - `direction ∈ {rising, falling, flat}`
 - `magnitude_pence` — predicted Δ price next week
 - `ridge_confidence` (0–100) — calibrated from backtested residuals, not
  from the model's raw output
 This is the **quantitative baseline**. It updates only when the BEIS Monday
 publication arrives (so the *forecast itself* changes weekly), but its
 *displayed confidence* (Layer 3) is adjusted in real time by Layers 4 and 5.
 `direction = flat` whenever `|magnitude_pence| < FLAT_THRESHOLD`. Phase 3
 picks `FLAT_THRESHOLD` from the backtest residual distribution; the
 starting value is **0.2p / litre**.
 ### Layer 2 — Local snapshot (descriptive, NOT predictive)
 Pure SQL aggregates against `station_prices_current` + Haversine on
 `stations.lat/lng`. No ML, no history, no surprises:
 - `local_avg_50km(fuel_type, lat, lng)`
 - `national_avg(fuel_type)`
 - `cheapest_within(km, fuel_type, lat, lng)`
 - `supermarket_avg_local`, `major_avg_local`, gap
 Layer 2 never speaks about the future. It describes the present.
 ### Layer 3 — Verdict merger (rule-based gates, no multipliers)
 Single user-facing verdict ∈ {`fill_now`, `wait`, `no_signal`}. The
 displayed confidence number is `ridge_confidence` itself, **untouched**.
 LLM agreement and volatility status are shown as separate **badges**, not
 blended into the number. Honesty over smoothing.
 Gates evaluated in order, first match wins:
 ```
 1. direction == 'flat'                                 → no_signal
 2. ridge_confidence < 40                               → no_signal
 3. volatility_regime active                            → no_signal  (badge: volatile)
 4. LLM disagrees AND ridge_confidence < 75             → no_signal  (badge: conflicting)
 5. rising  AND ridge_confidence >= 70                  → fill_now
 6. falling AND ridge_confidence >= 70                  → wait
 7. otherwise (40 <= conf < 70, no veto from 3 or 4)    → dashboard-only
 ```
 Why gates, not multipliers:
 - A multiplied confidence number is a black-box blend that the user can't
  audit. A 70% that used to be 90% before today's volatility hit looks
  identical to a 70% that's been calibrated all along.
 - Gates compose cleanly. Each rule has one job and is independently
  testable.
 - The verdict is binary anyway (notify / don't / silent). Smoothing
  confidence under the hood doesn't help that decision — it only obscures it.
 Layer 2 affects **urgency wording only** ("fill up now, *especially* in
 your area at 2p above national"). It never changes the verdict. Neither
 does Layer 4 or Layer 5 — they can suppress (gate 3, 4) but never flip
 the direction.
 ### Layer 4 — Daily LLM news overlay (qualitative, news-aware)
 **Single scheduled call at 07:00 UK.** Plus an event-driven refresh when
 Layer 5's volatility flag flips ON (with a 4-hour cooldown so the same
 event doesn't trigger repeatedly).
 JSON in, JSON out. Calls Claude Haiku with web search enabled, asks for
 direction + confidence + cited events with URLs. Stored in a new
 `llm_overlays` table.
 Layer 4 is **read-only with respect to the volatility flag**. It writes
 its result row; only Layer 5 mutates `volatility_regimes.active`.
 LLM confidence is hard-capped at 75 in code (web-searched LLMs are
 systematically overconfident). Calls without `events_cited` are rejected.
 ### Layer 5 — Volatility regime detector (intra-week safety net)
 Hourly cron. **Sole owner** of the `volatility_regimes.active` flag.
 Reads four signals, OR-combined:
 1. Daily Brent move > 3% close-to-close (FRED `DCOILBRENTEU`, Phase 7).
 2. Most recent `llm_overlays.major_impact_event = true` AND at least one
   verified URL.
 3. `station_prices` daily churn rate > 1.5× its 30-day baseline.
 4. A `watched_events` row covering today (manually flagged geopolitical
   periods).
 When the flag flips on:
 - An event-driven LLM refresh is queued (Layer 4) if last run was > 4h ago.
 - **Layer 3's gate 3 fires**: verdict forced to `no_signal` with the
  `volatile` badge.
 - The reasoning text appended: *"Volatility detected ({trigger}) — this
  forecast may be stale within days."*
 When it flips off:
 - Verdict returns to whatever the gates produce on the unchanged
  `ridge_confidence` (no multiplier to reset — there are none).
 - Badge cleared.
 - Next morning's 07:00 LLM call still runs (it always runs); no extra
  refreshes are queued.
 Layer 5 never changes Layer 1's *direction*. It only suppresses the
 verdict via gate 3.
 ---
 ## Methodology — Layer 1
 ### Target
 ```
 ΔULSP[t+1] = ULSP[t+1] − ULSP[t]
 ```
 We model the **change**, not the level. UK pump prices are non-stationary,
 so regressing on levels gives spurious R² and useless coefficients.
 Differencing makes the series stationary.
 ### Features (all stationary)
 | Feature | Notes |
 |---|---|
 | `Δulsp_lag_0`, `Δulsp_lag_1`, `Δulsp_lag_3` | 1w / 2w / 4w momentum |
 | `Δulsd_lag_0` | Diesel cross-signal as a *change* |
 | `ulsp[t] − ma8[t]` | **Mean-reversion term** — gap between current price and 8-week MA. Single most useful feature for 1-week-ahead UK pump forecast. |
 | `week_of_year_sin`, `week_of_year_cos` | Cyclic seasonality encoding |
 | `is_pre_bank_holiday` | Boolean, within 7 days of UK bank holiday |
 The level only enters as the deviation from MA-8 (itself stationary).
 That's the only way levels are allowed in.
 **Duty change is NOT a feature.** With one event in 435 weeks, n=1 cannot
 fit a meaningful coefficient. Instead, duty-change-adjacent weeks (±4
 weeks of a known change) are handled in the **calibration override**
 (see below) — confidence is halved and the regime flag is surfaced in
 the reasoning text. A regime can be flagged. A coefficient cannot be
 trained from one observation.
 ### Model
 Ridge regression. Boring on purpose:
 - 435 weekly observations is too few to beat a well-specified linear model
  out-of-sample with gradient boosting or LSTM — those would just fit noise.
 - Interpretable coefficients are essential for the honesty layer
  (the reasoning text describes what the model used).
 Upgrade to a non-linear model **only** if Phase 3 backtest demonstrates the
 linear model is missing real structure.
 ### Training and evaluation split
 - Train on weeks 1–305 (~70%).
 - Evaluate on weeks 306–435 (~30%) with rolling-origin cross-validation
  (single-split would overfit hyperparameters to one window).
 ### Confidence calibration
 Two-stage calibration:
 1. **Magnitude binning** — bin predictions by predicted `|magnitude|` and
   record actual hit rate per bin. The published `confidence_score` reads
   from this lookup, not from the model's raw output.
 2. **Regime flag** — flag any forecast week within ±4 weeks of a known
   duty change. With only one duty change in 435 weeks, statistical
   stratification at n=1 is impossible. Instead:
   - For flagged weeks, halve the calibrated confidence manually.
   - Surface the flag in the reasoning text: *"Recent duty change —
     forecast accuracy is reduced for the next several weeks."*
 This is the only place v1 accepts a hand-tuned guard, and it's there
 because the data can't tell us better.
 ---
 ## Methodology — Layer 2
 Pure aggregates. No model.
 ```sql
 -- Local 50km average
 SELECT AVG(price_pence) FROM station_prices_current
 JOIN stations ON station_prices_current.station_id = stations.node_id
 WHERE fuel_type = ? AND <Haversine within 50km of (lat, lng)>;
 -- National average
 SELECT AVG(price_pence) FROM station_prices_current WHERE fuel_type = ?;
 -- Cheapest within 25km
 SELECT stations.*, station_prices_current.price_pence
 FROM station_prices_current
 JOIN stations ON station_prices_current.station_id = stations.node_id
 WHERE fuel_type = ? AND <Haversine within 25km>
 ORDER BY price_pence ASC LIMIT 5;
 -- Supermarket vs major split, locally
 SELECT stations.is_supermarket, AVG(price_pence)
 FROM station_prices_current
 JOIN stations ON station_prices_current.station_id = stations.node_id
 WHERE fuel_type = ? AND <Haversine within 25km>
 GROUP BY stations.is_supermarket;
 ```
 Output is descriptive: "Your area is X p above national average right
 now", "Cheapest near you: {station} at {price}", "Supermarkets near you:
 {avg} vs majors: {avg}". **Never** predictive language.
 ---
 ## Methodology — Layer 3
 Full gate ordering is in the Architecture section (Layer 3). Summary:
 - Verdict via ordered rule gates, **not** multipliers.
 - `ridge_confidence` is displayed verbatim — never multiplied.
 - Volatility flag and LLM disagreement act as **suppressors with badges**
  (`volatile`, `conflicting`) but never flip direction.
 - `direction == 'flat'` always produces `no_signal`.
 - LLM disagreement only suppresses the verdict when `ridge_confidence < 75`.
  Above 75 the model's call is strong enough to stand even with a news-scan
  disagreement (the LLM is hard-capped at 75 confidence anyway, so it
  can't out-confidence the ridge model — only flag a tension).
 Local position from Layer 2 modifies urgency wording only:
 - If user's local average is materially above national (>2p), and Layer 1
  says "rising", urgency increased ("fill up now, *especially* in your area").
 - Layer 2 never flips Layer 1's direction.
 ---
 ## Methodology — Layer 4 (LLM news overlay)
 Single scheduled call daily at 07:00 UK. Additional event-driven calls
 are queued by Layer 5 when the volatility flag flips ON, with a 4-hour
 cooldown enforced in code (skip the queue if the most recent
 `llm_overlays.ran_at` is within 4 hours).
 **Brent input** (`brent_recent_14_days`) is optional — passed as `null`
 until Phase 7 backfills `brent_prices`. Phase 8 cannot ship before
 Phase 7 — explicit dependency.
 ### Request shape (JSON)
 ```json
 {
  "input": {
    "ulsp_recent_8_weeks": [...],
    "brent_recent_14_days": [...],
    "current_week_of_year": 18,
    "days_to_next_bank_holiday": 5,
    "duty_pence": 52.95,
    "ridge_model_says": {
      "direction": "down",
      "confidence": 68,
      "magnitude_pence": -0.4
    }
  },
  "ask": "Search recent news for oil-supply, OPEC, refinery, shipping, sanctions, geopolitical events affecting UK retail fuel prices over the next 1-2 weeks. Reply ONLY in the schema below."
 }
 ```
 ### Response shape (JSON, enforced)
 ```json
 {
  "direction": "rising | falling | flat",
  "confidence": 0,
  "reasoning_short": "1-2 sentences",
  "events_cited": [
    {"headline": "...", "source": "...", "url": "...", "impact": "rising|falling|neutral"}
  ],
  "agrees_with_ridge": true,
  "major_impact_event": false
 }
 ```
 ### Code-level guards (not in the prompt)
 1. **Cap `confidence` at 75.** Web-searched LLMs are systematically overconfident.
 2. **Reject the response if `events_cited` is empty.** Forces the LLM to
   ground its call in something checkable, not vibes.
 3. **Verify each `url` in `events_cited` is reachable** before storing.
   Catches hallucinated citations. Failed URLs blank the citation but
   don't reject the call (newer URLs sometimes 404 briefly).
 4. **Layer 4 does NOT mutate `volatility_regimes.active`.** It writes its
   row to `llm_overlays` (with `major_impact_event` + verified URLs) and
   that's it. Layer 5's hourly cron picks up the new row and decides
   whether to flip the flag.
 ### How Layer 3 uses it
 - LLM agrees → no gating effect; `agrees` badge shown next to the verdict
  ("News scan agrees, citing {event}").
 - LLM disagrees AND `ridge_confidence < 75` → **gate 4 fires**: verdict
  forced to `no_signal` with the `conflicting` badge.
 - LLM disagrees AND `ridge_confidence >= 75` → no suppression; the
  disagreement is shown as a badge but the model's strong call stands.
 - LLM neutral / flat → no gating effect.
 - Direction is never flipped by the LLM.
 ---
 ## Methodology — Layer 5 (volatility regime detector)
 Hourly cron. **Sole owner** of `volatility_regimes.active`. Reads four
 signals, OR-combined:
 1. **Brent move** — close-to-close daily Brent move > 3% on FRED
   `DCOILBRENTEU`. FRED publishes with a one-day lag (today's value is
   yesterday's settle), so the trigger reflects the most recent settled
   day. Sufficient for v1 — we don't have a real-time Brent feed.
 2. **LLM major-impact flag** — most recent `llm_overlays` row has
   `major_impact_event = true` AND at least one verified URL.
 3. **Station churn** — *gated until ≥180 days of stable polling.* The
   trigger fires when the last-24h % of stations updating price exceeds
   1.5× the 30-day rolling baseline. With only 75 days of uneven polling
   (Jan 16 → May 1) the baseline is meaningless — sample-mix variance
   would dominate any real shock signal. The trigger is implemented but
   disabled in code via a feature flag; flip it on once `station_prices`
   has 180+ continuous days.
 4. **Manual `watched_events`** — a row covering today. Lets you flag
   known geopolitical periods manually (e.g. "Iran tensions Apr–May 2026").
 When the flag flips on:
 - An event-driven Layer 4 LLM refresh is queued (skipped if the most
  recent `llm_overlays.ran_at` is within 4 hours — cooldown).
 - **Layer 3's gate 3 fires**: verdict forced to `no_signal` with the
  `volatile` badge for as long as the flag stays on.
 - Reasoning text appended: *"Volatility detected ({trigger label}) — this
  forecast may be stale within days."*
 When it flips off:
 - Verdict returns to whatever the gates produce on the unchanged
  `ridge_confidence` (no multiplier reset needed — there are no multipliers).
 - Badge cleared.
 - The next morning's 07:00 LLM call still runs (always does); no extra
  refreshes are queued by Layer 5.
 ---
 ## Schema deltas
 ### Add
 ```
 weekly_forecasts
  id                   BIGINT PK
  forecast_for         DATE                — Monday the forecast covers
  model_version        VARCHAR(32)         — links back to backtests row
  direction            ENUM('rising','falling','flat')
  magnitude_pence      SMALLINT            — predicted Δ × 100, signed
  ridge_confidence     TINYINT UNSIGNED    — 0..100, calibrated from backtested residuals. Displayed verbatim. Layer 3 gates may suppress the verdict but never modify this number.
  flagged_duty_change  BOOLEAN             — true if forecast is within ±4 weeks of a duty change (avoids collision with Layer 5's volatility_regimes)
  reasoning            TEXT                — generated from features actually used
  generated_at         DATETIME
  UNIQUE (forecast_for, model_version)
  INDEX  (forecast_for, generated_at DESC)
 forecast_outcomes
  forecast_for      DATE
  model_version     VARCHAR(32)
  predicted_class   ENUM('rising','falling','flat')
  actual_class      ENUM('rising','falling','flat')
  correct           BOOLEAN
  abs_error_pence   SMALLINT UNSIGNED
  resolved_at       DATETIME
  PRIMARY KEY (forecast_for, model_version)
 backtests
  id                 BIGINT PK
  model_version      VARCHAR(32) UNIQUE
  features_json      JSON                — feature spec
  train_start        DATE
  train_end          DATE
  eval_start         DATE
  eval_end           DATE
  directional_accuracy DECIMAL(5,2)
  mae_pence          DECIMAL(5,2)
  calibration_table  JSON                — {bin_low..bin_high → empirical_hit_rate}
  leak_suspected     BOOLEAN             — secondary smell test: true if directional_accuracy > 75. Primary leak detection is structural (see Backtest section).
  ran_at             DATETIME
 llm_overlays
  id                  BIGINT PK
  ran_at              DATETIME
  forecast_for_week   DATE                — which weekly forecast it overlays
  direction           ENUM('rising','falling','flat')
  confidence          TINYINT UNSIGNED    — capped 75 in code
  reasoning           TEXT
  events_json         JSON                — cited events with verified URLs
  agrees_with_ridge   BOOLEAN
  major_impact_event  BOOLEAN
  volatility_flag_on  BOOLEAN             — was the regime flag on at run time
  search_used         BOOLEAN
  INDEX (forecast_for_week, ran_at)
 volatility_regimes
  id                  BIGINT PK
  flipped_on_at       DATETIME
  flipped_off_at      DATETIME NULL
  trigger             ENUM('brent_move','llm_event','station_churn','manual')
  trigger_detail      TEXT                — e.g. "Brent +4.2% close-to-close"
  active              BOOLEAN
 watched_events
  id                  BIGINT PK
  label               VARCHAR(128)
  starts_at           DATETIME
  ends_at             DATETIME
  notes               TEXT
 ```
 ### Keep
 - `weekly_pump_prices` — already loaded, source of truth
 - `stations`, `station_prices_current` — for Layer 2
 - `station_prices` — keep collecting forward, not modelled in v1
 ### Deprecate (delete after Layer 1 ships)
 - `price_predictions` — old LLM/EWMA store, replaced by `weekly_forecasts`
 The current six-signal aggregator (`NationalFuelPredictionService` and
 `app/Services/Prediction/Signals/*`) is **fully replaced**, not extended.
 Same JSON output keys (`predicted_direction`, `confidence_score`,
 `action`, `reasoning`) so the Vue frontend doesn't break — engine swapped,
 contract preserved.
 ---
 ## Implementation phases (each ships something working)
 | Phase | Scope | Ships |
 |---|---|---|
 | **1. Backtest harness** | `BacktestRunner` service + `backtests` table. Takes a model class, train/eval split, returns directional accuracy + MAE + calibration curve. **Structural leak detection** built in (per-feature source-timestamp check vs target Monday); accuracy>75% smell test as secondary. | A way to *prove* any future model works before shipping it. |
 | **2. Naive baseline** | "Predict next week = this week" implemented as a model class. Run through harness. | A floor: any future model must beat this. |
 | **3. v1 ridge model** | Features above (incl. mean-reversion term), trained once, persisted with `model_version`. `WeeklyForecastService` runs it. Backtest must clear the acceptance gate. | First real forecast. Backtested numbers visible. |
 | **4. Live wiring** | Replace `NationalFuelPredictionService` internals with a thin adapter delegating to `WeeklyForecastService`. Same API shape, new engine. | Frontend keeps working, predictions now from the new model. |
 | **5. Local snapshot** | `LocalSnapshotService` — pure aggregates. Wire into `/api/stations` payload alongside the headline forecast. | "Your area" descriptive cards. |
 | **6. Honesty layer** | Reasoning generator describes *what the model used*: lag values, season, holiday flag. Shows backtest accuracy badge. Returns explicit "not enough data" when confidence < 40. Surfaces the duty-change-adjacent flag when set. | The "no BS" framing. |
 | **7. Brent backfill + daily refresh** | One FRED call (2018→today, ~2,150 daily rows). Daily refresh cron at **06:30 UK** (must complete before Phase 8's 07:00 LLM call — sequenced so the LLM has fresh Brent context). Used by Phase 9's volatility detector and as a feature option for future model iterations (only added to the ridge model if backtested lift is ≥3 percentage points directional accuracy). | Daily Brent in DB. Foundation for volatility + LLM context. |
 | **8. LLM news overlay** | `LlmOverlayService` — single scheduled call at **07:00 UK** (after Brent refresh). Plus event-driven calls when Layer 5 flips the volatility flag on, with 4h cooldown. JSON in / JSON out, web search enabled, results stored in `llm_overlays`. Feeds Layer 3's gate 4 (suppress when LLM disagrees AND ridge_confidence < 75) and the `agrees`/`conflicting` badges. URL-verification + empty-citation rejection enforced in code. **Depends on Phase 7.** | News-aware verdict suppression and badge on top of the calibrated ridge baseline. |
 | **9. Volatility regime detector** | `VolatilityRegimeService` — hourly cron, sole owner of `volatility_regimes.active`. OR-combines four triggers: Brent move > 3%, LLM `major_impact_event`, station churn > 1.5× baseline (**gated until ≥180 days of stable polling**), `watched_events` row covering today. Fires Layer 3's gate 3 (verdict → `no_signal` with `volatile` badge) and the event-driven Layer 4 refresh. | The intra-week safety net for oil shocks. |
 ---
 ## Backtest acceptance gates (Phase 3 → Phase 4)
 | Backtest result | Action |
 |---|---|
 | < 60% directional accuracy | Features are wrong. Stay in Phase 3, don't ship. |
 | 60–62% | Marginal. One feature iteration, then re-evaluate. |
 | **62–68%** | **Ship.** Realistic target for UK weekly pump direction without Brent. |
 | 68–75% | Excellent. Ship and watch closely. |
 | > 75% | **Stop.** Run the structural leak detector. Almost certainly time leakage (e.g. using `t+1` info accidentally in `t` features). The accuracy threshold is a secondary smell test, not the primary detector. |
 | MAE > 1.0p / litre | Features are noisy. Refit before shipping. |
 | Target MAE | 0.4–0.7p / litre. |
 ### Structural leak detection (primary)
 Built into the backtest harness. For every (training_week, feature_value)
 pair, the harness verifies the data source's effective timestamp is
 **strictly before** the target Monday. Any feature whose source timestamp
 is on or after the target week is treated as leakage and the backtest
 fails fast. This is independent of accuracy — it catches leakage even
 when it doesn't translate into suspiciously high accuracy.
 The `> 75% accuracy` row is a secondary smell test for leakage modes the
 structural check missed (e.g. label leakage via a downstream computed
 column). Primary defence is the timestamp check. These numbers are
 encoded in the harness as assertions, not aspirations.
 ---
 ## Honesty rules — non-negotiables
 1. Backtest accuracy is **published in the UI**. The model wears its track
   record on its sleeve.
 2. Below 40 confidence, the recommendation is `no_signal` and the reasoning
   says "we don't have enough signal to call it" — explicitly. No filler.
 3. When duty-change-adjacent weeks affect the forecast, surface the flag
   ("forecast may be skewed by recent duty change").
 4. Reasoning text only references features the model actually used — no
   narrative invention. If the mean-reversion term drove the call, say so
   ("Pump prices are 3.1p above their 8-week average, and prices typically
   pull back from that level"). If the seasonality term drove it, say so.
 5. `forecast_outcomes` is populated automatically when the next BEIS week
   lands. Hit rate over the trailing 13 weeks is shown next to the headline.
 6. When the **volatility regime flag** is on, the UI shows the `volatile`
   badge and the trigger (e.g. "Brent up 4.2% yesterday — forecast may be
   stale within days"). Verdict is suppressed visibly via gate 3, never
   silently.
 7. The LLM overlay is **shown separately** from the ridge model, never
   blended. "Model says down (68%); news scan agrees, citing {event}" —
   the `ridge_confidence` number stays calibrated and untouched, while
   LLM and volatility status are presented as their own badges.
 8. LLM citations with unreachable URLs are **dropped from the displayed
   reasoning** but kept in `llm_overlays.events_json` for audit. We never
   show a citation we haven't verified.
 ---
 ## What gets deleted at the end of Phase 4
 - `app/Services/Prediction/Signals/*` (whole directory)
 - `NationalFuelPredictionService` internals (kept as a thin wrapper, then
  renamed when the frontend migration completes)
 - `price_predictions` table — replaced by `weekly_forecasts` (ridge) +
  `llm_overlays` (news layer)
 - `OilPriceService::generatePrediction()`, EWMA/LLM helpers — replaced by
  `LlmOverlayService` (Phase 8) which has a different contract
 - `OilPriceService::fetchBrentPrices()` — kept and **expanded** in Phase 7
  (backfill mode + daily refresh), not deleted
 - `.claude/rules/scoring.md` retired in favour of a fresh
  `.claude/rules/forecasting.md`
 - `.claude/rules/prediction.md` rewritten to match the new architecture
 ---
 ## Open decisions (to confirm before Phase 1)
 - **Forecast cadence** — the *forecast itself* is weekly (matches BEIS
  publication). The *confidence and presentation* update daily via Layer 4
  (LLM) and Layer 5 (volatility regime). This split is deliberate — we
  refuse to fabricate intra-week movement, but we don't pretend a static
  Monday number is reliable on Thursday after a 6% Brent move.
 - **Scope** — drop the six-signal aggregator entirely, confirmed.
 - **API shape** — keep existing JSON output keys so Vue keeps working,
  with the engine swapped under the hood. The original `confidence_score`
  field maps to `ridge_confidence` (calibrated, untouched). Add new
  fields: `volatility` (`{active, trigger}`), `news_overlay`
  (`{direction, agreement, events}`), and `verdict_reason` (which gate
  fired, if any). The verdict itself goes in the existing `action` field.
 - **Brent** — promoted to Phase 7 (was "optional, conditional"). Needed
  for the volatility detector, regardless of whether it's used in the
  ridge model.
 - **LLM** — Anthropic Claude Haiku with web search. Single scheduled call
  at 07:00 UK (after the 06:30 Brent refresh). Plus event-driven refreshes
  when Layer 5 flips the volatility flag on, with a 4h cooldown. No fixed
  afternoon cron — by 13:00 UK, morning users have already made their
  fill-up decisions, so the value is too low to justify the extra noise.
  Hard confidence cap 75. Empty-citation rejection.
 ---
 ## Changelog (substantive design decisions)
 | When | Change | Why |
 |---|---|---|
 | 2026-05-01 v1 | Initial spec — three layers, six-signal aggregator removed, ridge model on BEIS weekly data | Replace incoherent `NationalFuelPredictionService` |
 | 2026-05-01 v2 | Added Layer 4 (LLM news overlay) and Layer 5 (volatility regime detector). Pump prices can move daily during oil shocks; static weekly forecast must be backed by intra-week safety nets. | Iran/Hormuz-style shocks make a Monday-only confidence number stale by Wednesday |
 | 2026-05-01 v3 | **Verdict via rule gates, not multipliers.** `ridge_confidence` displayed verbatim. LLM and volatility presented as badges. `weeks_since_duty_change` removed from features (kept as calibration override only — n=1 can't fit a coefficient). Backtest gate floor lowered 65 → 62 (realistic without Brent). Structural leak detection (per-feature timestamp check) made primary; accuracy>75% demoted to secondary smell test. `weekly_forecasts` PK changed to `(forecast_for, model_version)` to preserve audit on retrain. `forecast_outcomes` made three-class. Layer 5 station-churn trigger gated until ≥180 days of stable polling. | Multipliers obscure calibration. Gates compose cleanly and stay auditable. |
 ---
 ## References
 - Alquist, Kilian, Vigfusson (2013) — *Forecasting the Price of Oil* —
  the academic basis for "no-change baseline beats most structural models
  at <6m horizons" (which is why Phase 2 matters as a hard floor).
 - BEIS *Weekly road fuel prices* CSV — the 435-week training set.
 - `.claude/rules/scoring.md`, `.claude/rules/prediction.md` — the two
  inconsistent rule files this spec replaces.