From 25cf022964ce9dd2be31bdd62db927c3ed3e8177 Mon Sep 17 00:00:00 2001
From: Ovidiu U <nonreal@gmail.com>
Date: Fri, 1 May 2026 13:23:10 +0100
Subject: [PATCH] =?UTF-8?q?feat:=20add=20prediction=20rebuild=20design=20s?=
 =?UTF-8?q?pec=20=E2=80=94=20Layer=201=20ridge=20model,=20LLM=20news=20ove?=
 =?UTF-8?q?rlay,=20volatility=20regime=20detector?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Documents complete replacement of six-signal aggregator with calibrated
ridge forecaster trained on 435 weeks of BEIS pump prices. Five-layer
architecture: weekly baseline (Layer 1), local snapshot (Layer 2),
rule-gated verdict merger (Layer 3), daily LLM news
---
 .../2026-05-01-prediction-rebuild-design.md   | 618 ++++++++++++++++++
 1 file changed, 618 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md

diff --git a/docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md b/docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md
new file mode 100644
index 0000000..ba264e8
--- /dev/null
+++ b/docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md
@@ -0,0 +1,618 @@
+# Prediction Rebuild — Design Spec
+
+## Context
+
+The current prediction service (`NationalFuelPredictionService` + six signal
+classes) produces output the user has repeatedly described as "doesn't make
+sense": headlines that contradict their own reasoning text, weights that
+nobody can defend a number on, and confidence values that aren't grounded in
+any track record. Two earlier docs (`.claude/rules/scoring.md`, `.claude/rules/prediction.md`)
+disagree on the weights of the same signals, which is itself evidence that
+the design has drifted.
+
+This spec replaces the entire prediction stack from scratch around the
+historical data we actually have, with a model whose confidence values are
+calibrated against its own backtested track record.
+
+Goals:
+- A "fill up now or wait?" call honest about uncertainty.
+- Confidence values calibrated against backtested residuals — "70%" actually
+  means "in 7 of every 10 cases like this, the model called direction right".
+- Simple enough to debug a year from now.
+- Remove the six-signal aggregator entirely.
+- Recognise that pump prices, while *measured* weekly by BEIS, can *move* daily
+  during oil shocks (Iran, OPEC surprise cuts, Hormuz disruption). The static
+  weekly forecast must be backed by a daily news/event overlay so we can flag
+  staleness in real time rather than pretend a Monday number is still valid on
+  Thursday after a 6% Brent move.
+
+---
+
+## Inputs (audited 2026-05-01)
+
+| Source | Status | Use in v1 |
+|---|---|---|
+| `weekly_pump_prices` | 435 weeks, all Mondays, 0 outliers, 1 duty change (Mar 2022, 57.95p → 52.95p), VAT stable at 20% | **Foundation** — train Layer 1 |
+| `station_prices_current` | ~7,550 stations × e10, ~7,620 × b7_standard | **Layer 2** — descriptive snapshot |
+| `stations` | 7,747 stations, 1,989 supermarkets, lat/lng | Layer 2 |
+| `station_prices` | 75 days of changes since 2026-01-16, sample mix uneven per day | Not modelled in v1, but **used by the volatility regime detector** as a churn indicator (% stations changing price / day vs 30-day baseline). |
+| `brent_prices` | 30 days only | **Backfilled in Phase 7** (8 years from FRED, single API call). Used as a Brent-move volatility trigger and as fuel for the daily LLM overlay. |
+
+The Fuel Finder API has been confirmed empirically to have **no historical
+archive** — `effective-start-timestamp` is a station-level filter on current
+prices, not a time-window query. Per-station deep history can only accrue
+forward from the date polling started.
+
+---
+
+## Architecture — five thin layers
+
+### Layer 1 — National weekly forecaster (predictive, calibrated)
+
+Trained once weekly on `weekly_pump_prices`. Output:
+
+- `direction ∈ {rising, falling, flat}`
+- `magnitude_pence` — predicted Δ price next week
+- `ridge_confidence` (0–100) — calibrated from backtested residuals, not
+  from the model's raw output
+
+This is the **quantitative baseline**. It updates only when the BEIS Monday
+publication arrives (so the *forecast itself* changes weekly), but its
+*displayed confidence* (Layer 3) is adjusted in real time by Layers 4 and 5.
+
+`direction = flat` whenever `|magnitude_pence| < FLAT_THRESHOLD`. Phase 3
+picks `FLAT_THRESHOLD` from the backtest residual distribution; the
+starting value is **0.2p / litre**.
+
+### Layer 2 — Local snapshot (descriptive, NOT predictive)
+
+Pure SQL aggregates against `station_prices_current` + Haversine on
+`stations.lat/lng`. No ML, no history, no surprises:
+
+- `local_avg_50km(fuel_type, lat, lng)`
+- `national_avg(fuel_type)`
+- `cheapest_within(km, fuel_type, lat, lng)`
+- `supermarket_avg_local`, `major_avg_local`, gap
+
+Layer 2 never speaks about the future. It describes the present.
+
+### Layer 3 — Verdict merger (rule-based gates, no multipliers)
+
+Single user-facing verdict ∈ {`fill_now`, `wait`, `no_signal`}. The
+displayed confidence number is `ridge_confidence` itself, **untouched**.
+LLM agreement and volatility status are shown as separate **badges**, not
+blended into the number. Honesty over smoothing.
+
+Gates evaluated in order, first match wins:
+
+```
+1. direction == 'flat'                                 → no_signal
+2. ridge_confidence < 40                               → no_signal
+3. volatility_regime active                            → no_signal  (badge: volatile)
+4. LLM disagrees AND ridge_confidence < 75             → no_signal  (badge: conflicting)
+5. rising  AND ridge_confidence >= 70                  → fill_now
+6. falling AND ridge_confidence >= 70                  → wait
+7. otherwise (40 <= conf < 70, no veto from 3 or 4)    → dashboard-only
+```
+
+Why gates, not multipliers:
+
+- A multiplied confidence number is a black-box blend that the user can't
+  audit. A 70% that used to be 90% before today's volatility hit looks
+  identical to a 70% that's been calibrated all along.
+- Gates compose cleanly. Each rule has one job and is independently
+  testable.
+- The verdict is binary anyway (notify / don't / silent). Smoothing
+  confidence under the hood doesn't help that decision — it only obscures it.
+
+Layer 2 affects **urgency wording only** ("fill up now, *especially* in
+your area at 2p above national"). It never changes the verdict. Neither
+does Layer 4 or Layer 5 — they can suppress (gate 3, 4) but never flip
+the direction.
+
+### Layer 4 — Daily LLM news overlay (qualitative, news-aware)
+
+**Single scheduled call at 07:00 UK.** Plus an event-driven refresh when
+Layer 5's volatility flag flips ON (with a 4-hour cooldown so the same
+event doesn't trigger repeatedly).
+
+JSON in, JSON out. Calls Claude Haiku with web search enabled, asks for
+direction + confidence + cited events with URLs. Stored in a new
+`llm_overlays` table.
+
+Layer 4 is **read-only with respect to the volatility flag**. It writes
+its result row; only Layer 5 mutates `volatility_regimes.active`.
+
+LLM confidence is hard-capped at 75 in code (web-searched LLMs are
+systematically overconfident). Calls without `events_cited` are rejected.
+
+### Layer 5 — Volatility regime detector (intra-week safety net)
+
+Hourly cron. **Sole owner** of the `volatility_regimes.active` flag.
+Reads four signals, OR-combined:
+
+1. Daily Brent move > 3% close-to-close (FRED `DCOILBRENTEU`, Phase 7).
+2. Most recent `llm_overlays.major_impact_event = true` AND at least one
+   verified URL.
+3. `station_prices` daily churn rate > 1.5× its 30-day baseline.
+4. A `watched_events` row covering today (manually flagged geopolitical
+   periods).
+
+When the flag flips on:
+- An event-driven LLM refresh is queued (Layer 4) if last run was > 4h ago.
+- **Layer 3's gate 3 fires**: verdict forced to `no_signal` with the
+  `volatile` badge.
+- The reasoning text appended: *"Volatility detected ({trigger}) — this
+  forecast may be stale within days."*
+
+When it flips off:
+- Verdict returns to whatever the gates produce on the unchanged
+  `ridge_confidence` (no multiplier to reset — there are none).
+- Badge cleared.
+- Next morning's 07:00 LLM call still runs (it always runs); no extra
+  refreshes are queued.
+
+Layer 5 never changes Layer 1's *direction*. It only suppresses the
+verdict via gate 3.
+
+---
+
+## Methodology — Layer 1
+
+### Target
+
+```
+ΔULSP[t+1] = ULSP[t+1] − ULSP[t]
+```
+
+We model the **change**, not the level. UK pump prices are non-stationary,
+so regressing on levels gives spurious R² and useless coefficients.
+Differencing makes the series stationary.
+
+### Features (all stationary)
+
+| Feature | Notes |
+|---|---|
+| `Δulsp_lag_0`, `Δulsp_lag_1`, `Δulsp_lag_3` | 1w / 2w / 4w momentum |
+| `Δulsd_lag_0` | Diesel cross-signal as a *change* |
+| `ulsp[t] − ma8[t]` | **Mean-reversion term** — gap between current price and 8-week MA. Single most useful feature for 1-week-ahead UK pump forecast. |
+| `week_of_year_sin`, `week_of_year_cos` | Cyclic seasonality encoding |
+| `is_pre_bank_holiday` | Boolean, within 7 days of UK bank holiday |
+
+The level only enters as the deviation from MA-8 (itself stationary).
+That's the only way levels are allowed in.
+
+**Duty change is NOT a feature.** With one event in 435 weeks, n=1 cannot
+fit a meaningful coefficient. Instead, duty-change-adjacent weeks (±4
+weeks of a known change) are handled in the **calibration override**
+(see below) — confidence is halved and the regime flag is surfaced in
+the reasoning text. A regime can be flagged. A coefficient cannot be
+trained from one observation.
+
+### Model
+
+Ridge regression. Boring on purpose:
+
+- 435 weekly observations is too few to beat a well-specified linear model
+  out-of-sample with gradient boosting or LSTM — those would just fit noise.
+- Interpretable coefficients are essential for the honesty layer
+  (the reasoning text describes what the model used).
+
+Upgrade to a non-linear model **only** if Phase 3 backtest demonstrates the
+linear model is missing real structure.
+
+### Training and evaluation split
+
+- Train on weeks 1–305 (~70%).
+- Evaluate on weeks 306–435 (~30%) with rolling-origin cross-validation
+  (single-split would overfit hyperparameters to one window).
+
+### Confidence calibration
+
+Two-stage calibration:
+
+1. **Magnitude binning** — bin predictions by predicted `|magnitude|` and
+   record actual hit rate per bin. The published `confidence_score` reads
+   from this lookup, not from the model's raw output.
+2. **Regime flag** — flag any forecast week within ±4 weeks of a known
+   duty change. With only one duty change in 435 weeks, statistical
+   stratification at n=1 is impossible. Instead:
+   - For flagged weeks, halve the calibrated confidence manually.
+   - Surface the flag in the reasoning text: *"Recent duty change —
+     forecast accuracy is reduced for the next several weeks."*
+
+This is the only place v1 accepts a hand-tuned guard, and it's there
+because the data can't tell us better.
+
+---
+
+## Methodology — Layer 2
+
+Pure aggregates. No model.
+
+```sql
+-- Local 50km average
+SELECT AVG(price_pence) FROM station_prices_current
+JOIN stations ON station_prices_current.station_id = stations.node_id
+WHERE fuel_type = ? AND <Haversine within 50km of (lat, lng)>;
+
+-- National average
+SELECT AVG(price_pence) FROM station_prices_current WHERE fuel_type = ?;
+
+-- Cheapest within 25km
+SELECT stations.*, station_prices_current.price_pence
+FROM station_prices_current
+JOIN stations ON station_prices_current.station_id = stations.node_id
+WHERE fuel_type = ? AND <Haversine within 25km>
+ORDER BY price_pence ASC LIMIT 5;
+
+-- Supermarket vs major split, locally
+SELECT stations.is_supermarket, AVG(price_pence)
+FROM station_prices_current
+JOIN stations ON station_prices_current.station_id = stations.node_id
+WHERE fuel_type = ? AND <Haversine within 25km>
+GROUP BY stations.is_supermarket;
+```
+
+Output is descriptive: "Your area is X p above national average right
+now", "Cheapest near you: {station} at {price}", "Supermarkets near you:
+{avg} vs majors: {avg}". **Never** predictive language.
+
+---
+
+## Methodology — Layer 3
+
+Full gate ordering is in the Architecture section (Layer 3). Summary:
+
+- Verdict via ordered rule gates, **not** multipliers.
+- `ridge_confidence` is displayed verbatim — never multiplied.
+- Volatility flag and LLM disagreement act as **suppressors with badges**
+  (`volatile`, `conflicting`) but never flip direction.
+- `direction == 'flat'` always produces `no_signal`.
+- LLM disagreement only suppresses the verdict when `ridge_confidence < 75`.
+  Above 75 the model's call is strong enough to stand even with a news-scan
+  disagreement (the LLM is hard-capped at 75 confidence anyway, so it
+  can't out-confidence the ridge model — only flag a tension).
+
+Local position from Layer 2 modifies urgency wording only:
+
+- If user's local average is materially above national (>2p), and Layer 1
+  says "rising", urgency increased ("fill up now, *especially* in your area").
+- Layer 2 never flips Layer 1's direction.
+
+---
+
+## Methodology — Layer 4 (LLM news overlay)
+
+Single scheduled call daily at 07:00 UK. Additional event-driven calls
+are queued by Layer 5 when the volatility flag flips ON, with a 4-hour
+cooldown enforced in code (skip the queue if the most recent
+`llm_overlays.ran_at` is within 4 hours).
+
+**Brent input** (`brent_recent_14_days`) is optional — passed as `null`
+until Phase 7 backfills `brent_prices`. Phase 8 cannot ship before
+Phase 7 — explicit dependency.
+
+### Request shape (JSON)
+
+```json
+{
+  "input": {
+    "ulsp_recent_8_weeks": [...],
+    "brent_recent_14_days": [...],
+    "current_week_of_year": 18,
+    "days_to_next_bank_holiday": 5,
+    "duty_pence": 52.95,
+    "ridge_model_says": {
+      "direction": "down",
+      "confidence": 68,
+      "magnitude_pence": -0.4
+    }
+  },
+  "ask": "Search recent news for oil-supply, OPEC, refinery, shipping, sanctions, geopolitical events affecting UK retail fuel prices over the next 1-2 weeks. Reply ONLY in the schema below."
+}
+```
+
+### Response shape (JSON, enforced)
+
+```json
+{
+  "direction": "rising | falling | flat",
+  "confidence": 0,
+  "reasoning_short": "1-2 sentences",
+  "events_cited": [
+    {"headline": "...", "source": "...", "url": "...", "impact": "rising|falling|neutral"}
+  ],
+  "agrees_with_ridge": true,
+  "major_impact_event": false
+}
+```
+
+### Code-level guards (not in the prompt)
+
+1. **Cap `confidence` at 75.** Web-searched LLMs are systematically overconfident.
+2. **Reject the response if `events_cited` is empty.** Forces the LLM to
+   ground its call in something checkable, not vibes.
+3. **Verify each `url` in `events_cited` is reachable** before storing.
+   Catches hallucinated citations. Failed URLs blank the citation but
+   don't reject the call (newer URLs sometimes 404 briefly).
+4. **Layer 4 does NOT mutate `volatility_regimes.active`.** It writes its
+   row to `llm_overlays` (with `major_impact_event` + verified URLs) and
+   that's it. Layer 5's hourly cron picks up the new row and decides
+   whether to flip the flag.
+
+### How Layer 3 uses it
+
+- LLM agrees → no gating effect; `agrees` badge shown next to the verdict
+  ("News scan agrees, citing {event}").
+- LLM disagrees AND `ridge_confidence < 75` → **gate 4 fires**: verdict
+  forced to `no_signal` with the `conflicting` badge.
+- LLM disagrees AND `ridge_confidence >= 75` → no suppression; the
+  disagreement is shown as a badge but the model's strong call stands.
+- LLM neutral / flat → no gating effect.
+- Direction is never flipped by the LLM.
+
+---
+
+## Methodology — Layer 5 (volatility regime detector)
+
+Hourly cron. **Sole owner** of `volatility_regimes.active`. Reads four
+signals, OR-combined:
+
+1. **Brent move** — close-to-close daily Brent move > 3% on FRED
+   `DCOILBRENTEU`. FRED publishes with a one-day lag (today's value is
+   yesterday's settle), so the trigger reflects the most recent settled
+   day. Sufficient for v1 — we don't have a real-time Brent feed.
+2. **LLM major-impact flag** — most recent `llm_overlays` row has
+   `major_impact_event = true` AND at least one verified URL.
+3. **Station churn** — *gated until ≥180 days of stable polling.* The
+   trigger fires when the last-24h % of stations updating price exceeds
+   1.5× the 30-day rolling baseline. With only 75 days of uneven polling
+   (Jan 16 → May 1) the baseline is meaningless — sample-mix variance
+   would dominate any real shock signal. The trigger is implemented but
+   disabled in code via a feature flag; flip it on once `station_prices`
+   has 180+ continuous days.
+4. **Manual `watched_events`** — a row covering today. Lets you flag
+   known geopolitical periods manually (e.g. "Iran tensions Apr–May 2026").
+
+When the flag flips on:
+
+- An event-driven Layer 4 LLM refresh is queued (skipped if the most
+  recent `llm_overlays.ran_at` is within 4 hours — cooldown).
+- **Layer 3's gate 3 fires**: verdict forced to `no_signal` with the
+  `volatile` badge for as long as the flag stays on.
+- Reasoning text appended: *"Volatility detected ({trigger label}) — this
+  forecast may be stale within days."*
+
+When it flips off:
+- Verdict returns to whatever the gates produce on the unchanged
+  `ridge_confidence` (no multiplier reset needed — there are no multipliers).
+- Badge cleared.
+- The next morning's 07:00 LLM call still runs (always does); no extra
+  refreshes are queued by Layer 5.
+
+---
+
+## Schema deltas
+
+### Add
+
+```
+weekly_forecasts
+  id                   BIGINT PK
+  forecast_for         DATE                — Monday the forecast covers
+  model_version        VARCHAR(32)         — links back to backtests row
+  direction            ENUM('rising','falling','flat')
+  magnitude_pence      SMALLINT            — predicted Δ × 100, signed
+  ridge_confidence     TINYINT UNSIGNED    — 0..100, calibrated from backtested residuals. Displayed verbatim. Layer 3 gates may suppress the verdict but never modify this number.
+  flagged_duty_change  BOOLEAN             — true if forecast is within ±4 weeks of a duty change (avoids collision with Layer 5's volatility_regimes)
+  reasoning            TEXT                — generated from features actually used
+  generated_at         DATETIME
+  UNIQUE (forecast_for, model_version)
+  INDEX  (forecast_for, generated_at DESC)
+
+forecast_outcomes
+  forecast_for      DATE
+  model_version     VARCHAR(32)
+  predicted_class   ENUM('rising','falling','flat')
+  actual_class      ENUM('rising','falling','flat')
+  correct           BOOLEAN
+  abs_error_pence   SMALLINT UNSIGNED
+  resolved_at       DATETIME
+  PRIMARY KEY (forecast_for, model_version)
+
+backtests
+  id                 BIGINT PK
+  model_version      VARCHAR(32) UNIQUE
+  features_json      JSON                — feature spec
+  train_start        DATE
+  train_end          DATE
+  eval_start         DATE
+  eval_end           DATE
+  directional_accuracy DECIMAL(5,2)
+  mae_pence          DECIMAL(5,2)
+  calibration_table  JSON                — {bin_low..bin_high → empirical_hit_rate}
+  leak_suspected     BOOLEAN             — secondary smell test: true if directional_accuracy > 75. Primary leak detection is structural (see Backtest section).
+  ran_at             DATETIME
+
+llm_overlays
+  id                  BIGINT PK
+  ran_at              DATETIME
+  forecast_for_week   DATE                — which weekly forecast it overlays
+  direction           ENUM('rising','falling','flat')
+  confidence          TINYINT UNSIGNED    — capped 75 in code
+  reasoning           TEXT
+  events_json         JSON                — cited events with verified URLs
+  agrees_with_ridge   BOOLEAN
+  major_impact_event  BOOLEAN
+  volatility_flag_on  BOOLEAN             — was the regime flag on at run time
+  search_used         BOOLEAN
+  INDEX (forecast_for_week, ran_at)
+
+volatility_regimes
+  id                  BIGINT PK
+  flipped_on_at       DATETIME
+  flipped_off_at      DATETIME NULL
+  trigger             ENUM('brent_move','llm_event','station_churn','manual')
+  trigger_detail      TEXT                — e.g. "Brent +4.2% close-to-close"
+  active              BOOLEAN
+
+watched_events
+  id                  BIGINT PK
+  label               VARCHAR(128)
+  starts_at           DATETIME
+  ends_at             DATETIME
+  notes               TEXT
+```
+
+### Keep
+
+- `weekly_pump_prices` — already loaded, source of truth
+- `stations`, `station_prices_current` — for Layer 2
+- `station_prices` — keep collecting forward, not modelled in v1
+
+### Deprecate (delete after Layer 1 ships)
+
+- `price_predictions` — old LLM/EWMA store, replaced by `weekly_forecasts`
+
+The current six-signal aggregator (`NationalFuelPredictionService` and
+`app/Services/Prediction/Signals/*`) is **fully replaced**, not extended.
+Same JSON output keys (`predicted_direction`, `confidence_score`,
+`action`, `reasoning`) so the Vue frontend doesn't break — engine swapped,
+contract preserved.
+
+---
+
+## Implementation phases (each ships something working)
+
+| Phase | Scope | Ships |
+|---|---|---|
+| **1. Backtest harness** | `BacktestRunner` service + `backtests` table. Takes a model class, train/eval split, returns directional accuracy + MAE + calibration curve. **Structural leak detection** built in (per-feature source-timestamp check vs target Monday); accuracy>75% smell test as secondary. | A way to *prove* any future model works before shipping it. |
+| **2. Naive baseline** | "Predict next week = this week" implemented as a model class. Run through harness. | A floor: any future model must beat this. |
+| **3. v1 ridge model** | Features above (incl. mean-reversion term), trained once, persisted with `model_version`. `WeeklyForecastService` runs it. Backtest must clear the acceptance gate. | First real forecast. Backtested numbers visible. |
+| **4. Live wiring** | Replace `NationalFuelPredictionService` internals with a thin adapter delegating to `WeeklyForecastService`. Same API shape, new engine. | Frontend keeps working, predictions now from the new model. |
+| **5. Local snapshot** | `LocalSnapshotService` — pure aggregates. Wire into `/api/stations` payload alongside the headline forecast. | "Your area" descriptive cards. |
+| **6. Honesty layer** | Reasoning generator describes *what the model used*: lag values, season, holiday flag. Shows backtest accuracy badge. Returns explicit "not enough data" when confidence < 40. Surfaces the duty-change-adjacent flag when set. | The "no BS" framing. |
+| **7. Brent backfill + daily refresh** | One FRED call (2018→today, ~2,150 daily rows). Daily refresh cron at **06:30 UK** (must complete before Phase 8's 07:00 LLM call — sequenced so the LLM has fresh Brent context). Used by Phase 9's volatility detector and as a feature option for future model iterations (only added to the ridge model if backtested lift is ≥3 percentage points directional accuracy). | Daily Brent in DB. Foundation for volatility + LLM context. |
+| **8. LLM news overlay** | `LlmOverlayService` — single scheduled call at **07:00 UK** (after Brent refresh). Plus event-driven calls when Layer 5 flips the volatility flag on, with 4h cooldown. JSON in / JSON out, web search enabled, results stored in `llm_overlays`. Feeds Layer 3's gate 4 (suppress when LLM disagrees AND ridge_confidence < 75) and the `agrees`/`conflicting` badges. URL-verification + empty-citation rejection enforced in code. **Depends on Phase 7.** | News-aware verdict suppression and badge on top of the calibrated ridge baseline. |
+| **9. Volatility regime detector** | `VolatilityRegimeService` — hourly cron, sole owner of `volatility_regimes.active`. OR-combines four triggers: Brent move > 3%, LLM `major_impact_event`, station churn > 1.5× baseline (**gated until ≥180 days of stable polling**), `watched_events` row covering today. Fires Layer 3's gate 3 (verdict → `no_signal` with `volatile` badge) and the event-driven Layer 4 refresh. | The intra-week safety net for oil shocks. |
+
+---
+
+## Backtest acceptance gates (Phase 3 → Phase 4)
+
+| Backtest result | Action |
+|---|---|
+| < 60% directional accuracy | Features are wrong. Stay in Phase 3, don't ship. |
+| 60–62% | Marginal. One feature iteration, then re-evaluate. |
+| **62–68%** | **Ship.** Realistic target for UK weekly pump direction without Brent. |
+| 68–75% | Excellent. Ship and watch closely. |
+| > 75% | **Stop.** Run the structural leak detector. Almost certainly time leakage (e.g. using `t+1` info accidentally in `t` features). The accuracy threshold is a secondary smell test, not the primary detector. |
+| MAE > 1.0p / litre | Features are noisy. Refit before shipping. |
+| Target MAE | 0.4–0.7p / litre. |
+
+### Structural leak detection (primary)
+
+Built into the backtest harness. For every (training_week, feature_value)
+pair, the harness verifies the data source's effective timestamp is
+**strictly before** the target Monday. Any feature whose source timestamp
+is on or after the target week is treated as leakage and the backtest
+fails fast. This is independent of accuracy — it catches leakage even
+when it doesn't translate into suspiciously high accuracy.
+
+The `> 75% accuracy` row is a secondary smell test for leakage modes the
+structural check missed (e.g. label leakage via a downstream computed
+column). Primary defence is the timestamp check. These numbers are
+encoded in the harness as assertions, not aspirations.
+
+---
+
+## Honesty rules — non-negotiables
+
+1. Backtest accuracy is **published in the UI**. The model wears its track
+   record on its sleeve.
+2. Below 40 confidence, the recommendation is `no_signal` and the reasoning
+   says "we don't have enough signal to call it" — explicitly. No filler.
+3. When duty-change-adjacent weeks affect the forecast, surface the flag
+   ("forecast may be skewed by recent duty change").
+4. Reasoning text only references features the model actually used — no
+   narrative invention. If the mean-reversion term drove the call, say so
+   ("Pump prices are 3.1p above their 8-week average, and prices typically
+   pull back from that level"). If the seasonality term drove it, say so.
+5. `forecast_outcomes` is populated automatically when the next BEIS week
+   lands. Hit rate over the trailing 13 weeks is shown next to the headline.
+6. When the **volatility regime flag** is on, the UI shows the `volatile`
+   badge and the trigger (e.g. "Brent up 4.2% yesterday — forecast may be
+   stale within days"). Verdict is suppressed visibly via gate 3, never
+   silently.
+7. The LLM overlay is **shown separately** from the ridge model, never
+   blended. "Model says down (68%); news scan agrees, citing {event}" —
+   the `ridge_confidence` number stays calibrated and untouched, while
+   LLM and volatility status are presented as their own badges.
+8. LLM citations with unreachable URLs are **dropped from the displayed
+   reasoning** but kept in `llm_overlays.events_json` for audit. We never
+   show a citation we haven't verified.
+
+---
+
+## What gets deleted at the end of Phase 4
+
+- `app/Services/Prediction/Signals/*` (whole directory)
+- `NationalFuelPredictionService` internals (kept as a thin wrapper, then
+  renamed when the frontend migration completes)
+- `price_predictions` table — replaced by `weekly_forecasts` (ridge) +
+  `llm_overlays` (news layer)
+- `OilPriceService::generatePrediction()`, EWMA/LLM helpers — replaced by
+  `LlmOverlayService` (Phase 8) which has a different contract
+- `OilPriceService::fetchBrentPrices()` — kept and **expanded** in Phase 7
+  (backfill mode + daily refresh), not deleted
+- `.claude/rules/scoring.md` retired in favour of a fresh
+  `.claude/rules/forecasting.md`
+- `.claude/rules/prediction.md` rewritten to match the new architecture
+
+---
+
+## Open decisions (to confirm before Phase 1)
+
+- **Forecast cadence** — the *forecast itself* is weekly (matches BEIS
+  publication). The *confidence and presentation* update daily via Layer 4
+  (LLM) and Layer 5 (volatility regime). This split is deliberate — we
+  refuse to fabricate intra-week movement, but we don't pretend a static
+  Monday number is reliable on Thursday after a 6% Brent move.
+- **Scope** — drop the six-signal aggregator entirely, confirmed.
+- **API shape** — keep existing JSON output keys so Vue keeps working,
+  with the engine swapped under the hood. The original `confidence_score`
+  field maps to `ridge_confidence` (calibrated, untouched). Add new
+  fields: `volatility` (`{active, trigger}`), `news_overlay`
+  (`{direction, agreement, events}`), and `verdict_reason` (which gate
+  fired, if any). The verdict itself goes in the existing `action` field.
+- **Brent** — promoted to Phase 7 (was "optional, conditional"). Needed
+  for the volatility detector, regardless of whether it's used in the
+  ridge model.
+- **LLM** — Anthropic Claude Haiku with web search. Single scheduled call
+  at 07:00 UK (after the 06:30 Brent refresh). Plus event-driven refreshes
+  when Layer 5 flips the volatility flag on, with a 4h cooldown. No fixed
+  afternoon cron — by 13:00 UK, morning users have already made their
+  fill-up decisions, so the value is too low to justify the extra noise.
+  Hard confidence cap 75. Empty-citation rejection.
+
+---
+
+## Changelog (substantive design decisions)
+
+| When | Change | Why |
+|---|---|---|
+| 2026-05-01 v1 | Initial spec — three layers, six-signal aggregator removed, ridge model on BEIS weekly data | Replace incoherent `NationalFuelPredictionService` |
+| 2026-05-01 v2 | Added Layer 4 (LLM news overlay) and Layer 5 (volatility regime detector). Pump prices can move daily during oil shocks; static weekly forecast must be backed by intra-week safety nets. | Iran/Hormuz-style shocks make a Monday-only confidence number stale by Wednesday |
+| 2026-05-01 v3 | **Verdict via rule gates, not multipliers.** `ridge_confidence` displayed verbatim. LLM and volatility presented as badges. `weeks_since_duty_change` removed from features (kept as calibration override only — n=1 can't fit a coefficient). Backtest gate floor lowered 65 → 62 (realistic without Brent). Structural leak detection (per-feature timestamp check) made primary; accuracy>75% demoted to secondary smell test. `weekly_forecasts` PK changed to `(forecast_for, model_version)` to preserve audit on retrain. `forecast_outcomes` made three-class. Layer 5 station-churn trigger gated until ≥180 days of stable polling. | Multipliers obscure calibration. Gates compose cleanly and stay auditable. |
+
+---
+
+## References
+
+- Alquist, Kilian, Vigfusson (2013) — *Forecasting the Price of Oil* —
+  the academic basis for "no-change baseline beats most structural models
+  at <6m horizons" (which is why Phase 2 matters as a hard floor).
+- BEIS *Weekly road fuel prices* CSV — the 435-week training set.
+- `.claude/rules/scoring.md`, `.claude/rules/prediction.md` — the two
+  inconsistent rule files this spec replaces.