From 25cf022964ce9dd2be31bdd62db927c3ed3e8177 Mon Sep 17 00:00:00 2001 From: Ovidiu U Date: Fri, 1 May 2026 13:23:10 +0100 Subject: [PATCH] =?UTF-8?q?feat:=20add=20prediction=20rebuild=20design=20s?= =?UTF-8?q?pec=20=E2=80=94=20Layer=201=20ridge=20model,=20LLM=20news=20ove?= =?UTF-8?q?rlay,=20volatility=20regime=20detector?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents complete replacement of six-signal aggregator with calibrated ridge forecaster trained on 435 weeks of BEIS pump prices. Five-layer architecture: weekly baseline (Layer 1), local snapshot (Layer 2), rule-gated verdict merger (Layer 3), daily LLM news --- .../2026-05-01-prediction-rebuild-design.md | 618 ++++++++++++++++++ 1 file changed, 618 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md diff --git a/docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md b/docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md new file mode 100644 index 0000000..ba264e8 --- /dev/null +++ b/docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md @@ -0,0 +1,618 @@ +# Prediction Rebuild — Design Spec + +## Context + +The current prediction service (`NationalFuelPredictionService` + six signal +classes) produces output the user has repeatedly described as "doesn't make +sense": headlines that contradict their own reasoning text, weights that +nobody can defend a number on, and confidence values that aren't grounded in +any track record. Two earlier docs (`.claude/rules/scoring.md`, `.claude/rules/prediction.md`) +disagree on the weights of the same signals, which is itself evidence that +the design has drifted. + +This spec replaces the entire prediction stack from scratch around the +historical data we actually have, with a model whose confidence values are +calibrated against its own backtested track record. + +Goals: +- A "fill up now or wait?" call honest about uncertainty. +- Confidence values calibrated against backtested residuals — "70%" actually + means "in 7 of every 10 cases like this, the model called direction right". +- Simple enough to debug a year from now. +- Remove the six-signal aggregator entirely. +- Recognise that pump prices, while *measured* weekly by BEIS, can *move* daily + during oil shocks (Iran, OPEC surprise cuts, Hormuz disruption). The static + weekly forecast must be backed by a daily news/event overlay so we can flag + staleness in real time rather than pretend a Monday number is still valid on + Thursday after a 6% Brent move. + +--- + +## Inputs (audited 2026-05-01) + +| Source | Status | Use in v1 | +|---|---|---| +| `weekly_pump_prices` | 435 weeks, all Mondays, 0 outliers, 1 duty change (Mar 2022, 57.95p → 52.95p), VAT stable at 20% | **Foundation** — train Layer 1 | +| `station_prices_current` | ~7,550 stations × e10, ~7,620 × b7_standard | **Layer 2** — descriptive snapshot | +| `stations` | 7,747 stations, 1,989 supermarkets, lat/lng | Layer 2 | +| `station_prices` | 75 days of changes since 2026-01-16, sample mix uneven per day | Not modelled in v1, but **used by the volatility regime detector** as a churn indicator (% stations changing price / day vs 30-day baseline). | +| `brent_prices` | 30 days only | **Backfilled in Phase 7** (8 years from FRED, single API call). Used as a Brent-move volatility trigger and as fuel for the daily LLM overlay. | + +The Fuel Finder API has been confirmed empirically to have **no historical +archive** — `effective-start-timestamp` is a station-level filter on current +prices, not a time-window query. Per-station deep history can only accrue +forward from the date polling started. + +--- + +## Architecture — five thin layers + +### Layer 1 — National weekly forecaster (predictive, calibrated) + +Trained once weekly on `weekly_pump_prices`. Output: + +- `direction ∈ {rising, falling, flat}` +- `magnitude_pence` — predicted Δ price next week +- `ridge_confidence` (0–100) — calibrated from backtested residuals, not + from the model's raw output + +This is the **quantitative baseline**. It updates only when the BEIS Monday +publication arrives (so the *forecast itself* changes weekly), but its +*displayed confidence* (Layer 3) is adjusted in real time by Layers 4 and 5. + +`direction = flat` whenever `|magnitude_pence| < FLAT_THRESHOLD`. Phase 3 +picks `FLAT_THRESHOLD` from the backtest residual distribution; the +starting value is **0.2p / litre**. + +### Layer 2 — Local snapshot (descriptive, NOT predictive) + +Pure SQL aggregates against `station_prices_current` + Haversine on +`stations.lat/lng`. No ML, no history, no surprises: + +- `local_avg_50km(fuel_type, lat, lng)` +- `national_avg(fuel_type)` +- `cheapest_within(km, fuel_type, lat, lng)` +- `supermarket_avg_local`, `major_avg_local`, gap + +Layer 2 never speaks about the future. It describes the present. + +### Layer 3 — Verdict merger (rule-based gates, no multipliers) + +Single user-facing verdict ∈ {`fill_now`, `wait`, `no_signal`}. The +displayed confidence number is `ridge_confidence` itself, **untouched**. +LLM agreement and volatility status are shown as separate **badges**, not +blended into the number. Honesty over smoothing. + +Gates evaluated in order, first match wins: + +``` +1. direction == 'flat' → no_signal +2. ridge_confidence < 40 → no_signal +3. volatility_regime active → no_signal (badge: volatile) +4. LLM disagrees AND ridge_confidence < 75 → no_signal (badge: conflicting) +5. rising AND ridge_confidence >= 70 → fill_now +6. falling AND ridge_confidence >= 70 → wait +7. otherwise (40 <= conf < 70, no veto from 3 or 4) → dashboard-only +``` + +Why gates, not multipliers: + +- A multiplied confidence number is a black-box blend that the user can't + audit. A 70% that used to be 90% before today's volatility hit looks + identical to a 70% that's been calibrated all along. +- Gates compose cleanly. Each rule has one job and is independently + testable. +- The verdict is binary anyway (notify / don't / silent). Smoothing + confidence under the hood doesn't help that decision — it only obscures it. + +Layer 2 affects **urgency wording only** ("fill up now, *especially* in +your area at 2p above national"). It never changes the verdict. Neither +does Layer 4 or Layer 5 — they can suppress (gate 3, 4) but never flip +the direction. + +### Layer 4 — Daily LLM news overlay (qualitative, news-aware) + +**Single scheduled call at 07:00 UK.** Plus an event-driven refresh when +Layer 5's volatility flag flips ON (with a 4-hour cooldown so the same +event doesn't trigger repeatedly). + +JSON in, JSON out. Calls Claude Haiku with web search enabled, asks for +direction + confidence + cited events with URLs. Stored in a new +`llm_overlays` table. + +Layer 4 is **read-only with respect to the volatility flag**. It writes +its result row; only Layer 5 mutates `volatility_regimes.active`. + +LLM confidence is hard-capped at 75 in code (web-searched LLMs are +systematically overconfident). Calls without `events_cited` are rejected. + +### Layer 5 — Volatility regime detector (intra-week safety net) + +Hourly cron. **Sole owner** of the `volatility_regimes.active` flag. +Reads four signals, OR-combined: + +1. Daily Brent move > 3% close-to-close (FRED `DCOILBRENTEU`, Phase 7). +2. Most recent `llm_overlays.major_impact_event = true` AND at least one + verified URL. +3. `station_prices` daily churn rate > 1.5× its 30-day baseline. +4. A `watched_events` row covering today (manually flagged geopolitical + periods). + +When the flag flips on: +- An event-driven LLM refresh is queued (Layer 4) if last run was > 4h ago. +- **Layer 3's gate 3 fires**: verdict forced to `no_signal` with the + `volatile` badge. +- The reasoning text appended: *"Volatility detected ({trigger}) — this + forecast may be stale within days."* + +When it flips off: +- Verdict returns to whatever the gates produce on the unchanged + `ridge_confidence` (no multiplier to reset — there are none). +- Badge cleared. +- Next morning's 07:00 LLM call still runs (it always runs); no extra + refreshes are queued. + +Layer 5 never changes Layer 1's *direction*. It only suppresses the +verdict via gate 3. + +--- + +## Methodology — Layer 1 + +### Target + +``` +ΔULSP[t+1] = ULSP[t+1] − ULSP[t] +``` + +We model the **change**, not the level. UK pump prices are non-stationary, +so regressing on levels gives spurious R² and useless coefficients. +Differencing makes the series stationary. + +### Features (all stationary) + +| Feature | Notes | +|---|---| +| `Δulsp_lag_0`, `Δulsp_lag_1`, `Δulsp_lag_3` | 1w / 2w / 4w momentum | +| `Δulsd_lag_0` | Diesel cross-signal as a *change* | +| `ulsp[t] − ma8[t]` | **Mean-reversion term** — gap between current price and 8-week MA. Single most useful feature for 1-week-ahead UK pump forecast. | +| `week_of_year_sin`, `week_of_year_cos` | Cyclic seasonality encoding | +| `is_pre_bank_holiday` | Boolean, within 7 days of UK bank holiday | + +The level only enters as the deviation from MA-8 (itself stationary). +That's the only way levels are allowed in. + +**Duty change is NOT a feature.** With one event in 435 weeks, n=1 cannot +fit a meaningful coefficient. Instead, duty-change-adjacent weeks (±4 +weeks of a known change) are handled in the **calibration override** +(see below) — confidence is halved and the regime flag is surfaced in +the reasoning text. A regime can be flagged. A coefficient cannot be +trained from one observation. + +### Model + +Ridge regression. Boring on purpose: + +- 435 weekly observations is too few to beat a well-specified linear model + out-of-sample with gradient boosting or LSTM — those would just fit noise. +- Interpretable coefficients are essential for the honesty layer + (the reasoning text describes what the model used). + +Upgrade to a non-linear model **only** if Phase 3 backtest demonstrates the +linear model is missing real structure. + +### Training and evaluation split + +- Train on weeks 1–305 (~70%). +- Evaluate on weeks 306–435 (~30%) with rolling-origin cross-validation + (single-split would overfit hyperparameters to one window). + +### Confidence calibration + +Two-stage calibration: + +1. **Magnitude binning** — bin predictions by predicted `|magnitude|` and + record actual hit rate per bin. The published `confidence_score` reads + from this lookup, not from the model's raw output. +2. **Regime flag** — flag any forecast week within ±4 weeks of a known + duty change. With only one duty change in 435 weeks, statistical + stratification at n=1 is impossible. Instead: + - For flagged weeks, halve the calibrated confidence manually. + - Surface the flag in the reasoning text: *"Recent duty change — + forecast accuracy is reduced for the next several weeks."* + +This is the only place v1 accepts a hand-tuned guard, and it's there +because the data can't tell us better. + +--- + +## Methodology — Layer 2 + +Pure aggregates. No model. + +```sql +-- Local 50km average +SELECT AVG(price_pence) FROM station_prices_current +JOIN stations ON station_prices_current.station_id = stations.node_id +WHERE fuel_type = ? AND ; + +-- National average +SELECT AVG(price_pence) FROM station_prices_current WHERE fuel_type = ?; + +-- Cheapest within 25km +SELECT stations.*, station_prices_current.price_pence +FROM station_prices_current +JOIN stations ON station_prices_current.station_id = stations.node_id +WHERE fuel_type = ? AND +ORDER BY price_pence ASC LIMIT 5; + +-- Supermarket vs major split, locally +SELECT stations.is_supermarket, AVG(price_pence) +FROM station_prices_current +JOIN stations ON station_prices_current.station_id = stations.node_id +WHERE fuel_type = ? AND +GROUP BY stations.is_supermarket; +``` + +Output is descriptive: "Your area is X p above national average right +now", "Cheapest near you: {station} at {price}", "Supermarkets near you: +{avg} vs majors: {avg}". **Never** predictive language. + +--- + +## Methodology — Layer 3 + +Full gate ordering is in the Architecture section (Layer 3). Summary: + +- Verdict via ordered rule gates, **not** multipliers. +- `ridge_confidence` is displayed verbatim — never multiplied. +- Volatility flag and LLM disagreement act as **suppressors with badges** + (`volatile`, `conflicting`) but never flip direction. +- `direction == 'flat'` always produces `no_signal`. +- LLM disagreement only suppresses the verdict when `ridge_confidence < 75`. + Above 75 the model's call is strong enough to stand even with a news-scan + disagreement (the LLM is hard-capped at 75 confidence anyway, so it + can't out-confidence the ridge model — only flag a tension). + +Local position from Layer 2 modifies urgency wording only: + +- If user's local average is materially above national (>2p), and Layer 1 + says "rising", urgency increased ("fill up now, *especially* in your area"). +- Layer 2 never flips Layer 1's direction. + +--- + +## Methodology — Layer 4 (LLM news overlay) + +Single scheduled call daily at 07:00 UK. Additional event-driven calls +are queued by Layer 5 when the volatility flag flips ON, with a 4-hour +cooldown enforced in code (skip the queue if the most recent +`llm_overlays.ran_at` is within 4 hours). + +**Brent input** (`brent_recent_14_days`) is optional — passed as `null` +until Phase 7 backfills `brent_prices`. Phase 8 cannot ship before +Phase 7 — explicit dependency. + +### Request shape (JSON) + +```json +{ + "input": { + "ulsp_recent_8_weeks": [...], + "brent_recent_14_days": [...], + "current_week_of_year": 18, + "days_to_next_bank_holiday": 5, + "duty_pence": 52.95, + "ridge_model_says": { + "direction": "down", + "confidence": 68, + "magnitude_pence": -0.4 + } + }, + "ask": "Search recent news for oil-supply, OPEC, refinery, shipping, sanctions, geopolitical events affecting UK retail fuel prices over the next 1-2 weeks. Reply ONLY in the schema below." +} +``` + +### Response shape (JSON, enforced) + +```json +{ + "direction": "rising | falling | flat", + "confidence": 0, + "reasoning_short": "1-2 sentences", + "events_cited": [ + {"headline": "...", "source": "...", "url": "...", "impact": "rising|falling|neutral"} + ], + "agrees_with_ridge": true, + "major_impact_event": false +} +``` + +### Code-level guards (not in the prompt) + +1. **Cap `confidence` at 75.** Web-searched LLMs are systematically overconfident. +2. **Reject the response if `events_cited` is empty.** Forces the LLM to + ground its call in something checkable, not vibes. +3. **Verify each `url` in `events_cited` is reachable** before storing. + Catches hallucinated citations. Failed URLs blank the citation but + don't reject the call (newer URLs sometimes 404 briefly). +4. **Layer 4 does NOT mutate `volatility_regimes.active`.** It writes its + row to `llm_overlays` (with `major_impact_event` + verified URLs) and + that's it. Layer 5's hourly cron picks up the new row and decides + whether to flip the flag. + +### How Layer 3 uses it + +- LLM agrees → no gating effect; `agrees` badge shown next to the verdict + ("News scan agrees, citing {event}"). +- LLM disagrees AND `ridge_confidence < 75` → **gate 4 fires**: verdict + forced to `no_signal` with the `conflicting` badge. +- LLM disagrees AND `ridge_confidence >= 75` → no suppression; the + disagreement is shown as a badge but the model's strong call stands. +- LLM neutral / flat → no gating effect. +- Direction is never flipped by the LLM. + +--- + +## Methodology — Layer 5 (volatility regime detector) + +Hourly cron. **Sole owner** of `volatility_regimes.active`. Reads four +signals, OR-combined: + +1. **Brent move** — close-to-close daily Brent move > 3% on FRED + `DCOILBRENTEU`. FRED publishes with a one-day lag (today's value is + yesterday's settle), so the trigger reflects the most recent settled + day. Sufficient for v1 — we don't have a real-time Brent feed. +2. **LLM major-impact flag** — most recent `llm_overlays` row has + `major_impact_event = true` AND at least one verified URL. +3. **Station churn** — *gated until ≥180 days of stable polling.* The + trigger fires when the last-24h % of stations updating price exceeds + 1.5× the 30-day rolling baseline. With only 75 days of uneven polling + (Jan 16 → May 1) the baseline is meaningless — sample-mix variance + would dominate any real shock signal. The trigger is implemented but + disabled in code via a feature flag; flip it on once `station_prices` + has 180+ continuous days. +4. **Manual `watched_events`** — a row covering today. Lets you flag + known geopolitical periods manually (e.g. "Iran tensions Apr–May 2026"). + +When the flag flips on: + +- An event-driven Layer 4 LLM refresh is queued (skipped if the most + recent `llm_overlays.ran_at` is within 4 hours — cooldown). +- **Layer 3's gate 3 fires**: verdict forced to `no_signal` with the + `volatile` badge for as long as the flag stays on. +- Reasoning text appended: *"Volatility detected ({trigger label}) — this + forecast may be stale within days."* + +When it flips off: +- Verdict returns to whatever the gates produce on the unchanged + `ridge_confidence` (no multiplier reset needed — there are no multipliers). +- Badge cleared. +- The next morning's 07:00 LLM call still runs (always does); no extra + refreshes are queued by Layer 5. + +--- + +## Schema deltas + +### Add + +``` +weekly_forecasts + id BIGINT PK + forecast_for DATE — Monday the forecast covers + model_version VARCHAR(32) — links back to backtests row + direction ENUM('rising','falling','flat') + magnitude_pence SMALLINT — predicted Δ × 100, signed + ridge_confidence TINYINT UNSIGNED — 0..100, calibrated from backtested residuals. Displayed verbatim. Layer 3 gates may suppress the verdict but never modify this number. + flagged_duty_change BOOLEAN — true if forecast is within ±4 weeks of a duty change (avoids collision with Layer 5's volatility_regimes) + reasoning TEXT — generated from features actually used + generated_at DATETIME + UNIQUE (forecast_for, model_version) + INDEX (forecast_for, generated_at DESC) + +forecast_outcomes + forecast_for DATE + model_version VARCHAR(32) + predicted_class ENUM('rising','falling','flat') + actual_class ENUM('rising','falling','flat') + correct BOOLEAN + abs_error_pence SMALLINT UNSIGNED + resolved_at DATETIME + PRIMARY KEY (forecast_for, model_version) + +backtests + id BIGINT PK + model_version VARCHAR(32) UNIQUE + features_json JSON — feature spec + train_start DATE + train_end DATE + eval_start DATE + eval_end DATE + directional_accuracy DECIMAL(5,2) + mae_pence DECIMAL(5,2) + calibration_table JSON — {bin_low..bin_high → empirical_hit_rate} + leak_suspected BOOLEAN — secondary smell test: true if directional_accuracy > 75. Primary leak detection is structural (see Backtest section). + ran_at DATETIME + +llm_overlays + id BIGINT PK + ran_at DATETIME + forecast_for_week DATE — which weekly forecast it overlays + direction ENUM('rising','falling','flat') + confidence TINYINT UNSIGNED — capped 75 in code + reasoning TEXT + events_json JSON — cited events with verified URLs + agrees_with_ridge BOOLEAN + major_impact_event BOOLEAN + volatility_flag_on BOOLEAN — was the regime flag on at run time + search_used BOOLEAN + INDEX (forecast_for_week, ran_at) + +volatility_regimes + id BIGINT PK + flipped_on_at DATETIME + flipped_off_at DATETIME NULL + trigger ENUM('brent_move','llm_event','station_churn','manual') + trigger_detail TEXT — e.g. "Brent +4.2% close-to-close" + active BOOLEAN + +watched_events + id BIGINT PK + label VARCHAR(128) + starts_at DATETIME + ends_at DATETIME + notes TEXT +``` + +### Keep + +- `weekly_pump_prices` — already loaded, source of truth +- `stations`, `station_prices_current` — for Layer 2 +- `station_prices` — keep collecting forward, not modelled in v1 + +### Deprecate (delete after Layer 1 ships) + +- `price_predictions` — old LLM/EWMA store, replaced by `weekly_forecasts` + +The current six-signal aggregator (`NationalFuelPredictionService` and +`app/Services/Prediction/Signals/*`) is **fully replaced**, not extended. +Same JSON output keys (`predicted_direction`, `confidence_score`, +`action`, `reasoning`) so the Vue frontend doesn't break — engine swapped, +contract preserved. + +--- + +## Implementation phases (each ships something working) + +| Phase | Scope | Ships | +|---|---|---| +| **1. Backtest harness** | `BacktestRunner` service + `backtests` table. Takes a model class, train/eval split, returns directional accuracy + MAE + calibration curve. **Structural leak detection** built in (per-feature source-timestamp check vs target Monday); accuracy>75% smell test as secondary. | A way to *prove* any future model works before shipping it. | +| **2. Naive baseline** | "Predict next week = this week" implemented as a model class. Run through harness. | A floor: any future model must beat this. | +| **3. v1 ridge model** | Features above (incl. mean-reversion term), trained once, persisted with `model_version`. `WeeklyForecastService` runs it. Backtest must clear the acceptance gate. | First real forecast. Backtested numbers visible. | +| **4. Live wiring** | Replace `NationalFuelPredictionService` internals with a thin adapter delegating to `WeeklyForecastService`. Same API shape, new engine. | Frontend keeps working, predictions now from the new model. | +| **5. Local snapshot** | `LocalSnapshotService` — pure aggregates. Wire into `/api/stations` payload alongside the headline forecast. | "Your area" descriptive cards. | +| **6. Honesty layer** | Reasoning generator describes *what the model used*: lag values, season, holiday flag. Shows backtest accuracy badge. Returns explicit "not enough data" when confidence < 40. Surfaces the duty-change-adjacent flag when set. | The "no BS" framing. | +| **7. Brent backfill + daily refresh** | One FRED call (2018→today, ~2,150 daily rows). Daily refresh cron at **06:30 UK** (must complete before Phase 8's 07:00 LLM call — sequenced so the LLM has fresh Brent context). Used by Phase 9's volatility detector and as a feature option for future model iterations (only added to the ridge model if backtested lift is ≥3 percentage points directional accuracy). | Daily Brent in DB. Foundation for volatility + LLM context. | +| **8. LLM news overlay** | `LlmOverlayService` — single scheduled call at **07:00 UK** (after Brent refresh). Plus event-driven calls when Layer 5 flips the volatility flag on, with 4h cooldown. JSON in / JSON out, web search enabled, results stored in `llm_overlays`. Feeds Layer 3's gate 4 (suppress when LLM disagrees AND ridge_confidence < 75) and the `agrees`/`conflicting` badges. URL-verification + empty-citation rejection enforced in code. **Depends on Phase 7.** | News-aware verdict suppression and badge on top of the calibrated ridge baseline. | +| **9. Volatility regime detector** | `VolatilityRegimeService` — hourly cron, sole owner of `volatility_regimes.active`. OR-combines four triggers: Brent move > 3%, LLM `major_impact_event`, station churn > 1.5× baseline (**gated until ≥180 days of stable polling**), `watched_events` row covering today. Fires Layer 3's gate 3 (verdict → `no_signal` with `volatile` badge) and the event-driven Layer 4 refresh. | The intra-week safety net for oil shocks. | + +--- + +## Backtest acceptance gates (Phase 3 → Phase 4) + +| Backtest result | Action | +|---|---| +| < 60% directional accuracy | Features are wrong. Stay in Phase 3, don't ship. | +| 60–62% | Marginal. One feature iteration, then re-evaluate. | +| **62–68%** | **Ship.** Realistic target for UK weekly pump direction without Brent. | +| 68–75% | Excellent. Ship and watch closely. | +| > 75% | **Stop.** Run the structural leak detector. Almost certainly time leakage (e.g. using `t+1` info accidentally in `t` features). The accuracy threshold is a secondary smell test, not the primary detector. | +| MAE > 1.0p / litre | Features are noisy. Refit before shipping. | +| Target MAE | 0.4–0.7p / litre. | + +### Structural leak detection (primary) + +Built into the backtest harness. For every (training_week, feature_value) +pair, the harness verifies the data source's effective timestamp is +**strictly before** the target Monday. Any feature whose source timestamp +is on or after the target week is treated as leakage and the backtest +fails fast. This is independent of accuracy — it catches leakage even +when it doesn't translate into suspiciously high accuracy. + +The `> 75% accuracy` row is a secondary smell test for leakage modes the +structural check missed (e.g. label leakage via a downstream computed +column). Primary defence is the timestamp check. These numbers are +encoded in the harness as assertions, not aspirations. + +--- + +## Honesty rules — non-negotiables + +1. Backtest accuracy is **published in the UI**. The model wears its track + record on its sleeve. +2. Below 40 confidence, the recommendation is `no_signal` and the reasoning + says "we don't have enough signal to call it" — explicitly. No filler. +3. When duty-change-adjacent weeks affect the forecast, surface the flag + ("forecast may be skewed by recent duty change"). +4. Reasoning text only references features the model actually used — no + narrative invention. If the mean-reversion term drove the call, say so + ("Pump prices are 3.1p above their 8-week average, and prices typically + pull back from that level"). If the seasonality term drove it, say so. +5. `forecast_outcomes` is populated automatically when the next BEIS week + lands. Hit rate over the trailing 13 weeks is shown next to the headline. +6. When the **volatility regime flag** is on, the UI shows the `volatile` + badge and the trigger (e.g. "Brent up 4.2% yesterday — forecast may be + stale within days"). Verdict is suppressed visibly via gate 3, never + silently. +7. The LLM overlay is **shown separately** from the ridge model, never + blended. "Model says down (68%); news scan agrees, citing {event}" — + the `ridge_confidence` number stays calibrated and untouched, while + LLM and volatility status are presented as their own badges. +8. LLM citations with unreachable URLs are **dropped from the displayed + reasoning** but kept in `llm_overlays.events_json` for audit. We never + show a citation we haven't verified. + +--- + +## What gets deleted at the end of Phase 4 + +- `app/Services/Prediction/Signals/*` (whole directory) +- `NationalFuelPredictionService` internals (kept as a thin wrapper, then + renamed when the frontend migration completes) +- `price_predictions` table — replaced by `weekly_forecasts` (ridge) + + `llm_overlays` (news layer) +- `OilPriceService::generatePrediction()`, EWMA/LLM helpers — replaced by + `LlmOverlayService` (Phase 8) which has a different contract +- `OilPriceService::fetchBrentPrices()` — kept and **expanded** in Phase 7 + (backfill mode + daily refresh), not deleted +- `.claude/rules/scoring.md` retired in favour of a fresh + `.claude/rules/forecasting.md` +- `.claude/rules/prediction.md` rewritten to match the new architecture + +--- + +## Open decisions (to confirm before Phase 1) + +- **Forecast cadence** — the *forecast itself* is weekly (matches BEIS + publication). The *confidence and presentation* update daily via Layer 4 + (LLM) and Layer 5 (volatility regime). This split is deliberate — we + refuse to fabricate intra-week movement, but we don't pretend a static + Monday number is reliable on Thursday after a 6% Brent move. +- **Scope** — drop the six-signal aggregator entirely, confirmed. +- **API shape** — keep existing JSON output keys so Vue keeps working, + with the engine swapped under the hood. The original `confidence_score` + field maps to `ridge_confidence` (calibrated, untouched). Add new + fields: `volatility` (`{active, trigger}`), `news_overlay` + (`{direction, agreement, events}`), and `verdict_reason` (which gate + fired, if any). The verdict itself goes in the existing `action` field. +- **Brent** — promoted to Phase 7 (was "optional, conditional"). Needed + for the volatility detector, regardless of whether it's used in the + ridge model. +- **LLM** — Anthropic Claude Haiku with web search. Single scheduled call + at 07:00 UK (after the 06:30 Brent refresh). Plus event-driven refreshes + when Layer 5 flips the volatility flag on, with a 4h cooldown. No fixed + afternoon cron — by 13:00 UK, morning users have already made their + fill-up decisions, so the value is too low to justify the extra noise. + Hard confidence cap 75. Empty-citation rejection. + +--- + +## Changelog (substantive design decisions) + +| When | Change | Why | +|---|---|---| +| 2026-05-01 v1 | Initial spec — three layers, six-signal aggregator removed, ridge model on BEIS weekly data | Replace incoherent `NationalFuelPredictionService` | +| 2026-05-01 v2 | Added Layer 4 (LLM news overlay) and Layer 5 (volatility regime detector). Pump prices can move daily during oil shocks; static weekly forecast must be backed by intra-week safety nets. | Iran/Hormuz-style shocks make a Monday-only confidence number stale by Wednesday | +| 2026-05-01 v3 | **Verdict via rule gates, not multipliers.** `ridge_confidence` displayed verbatim. LLM and volatility presented as badges. `weeks_since_duty_change` removed from features (kept as calibration override only — n=1 can't fit a coefficient). Backtest gate floor lowered 65 → 62 (realistic without Brent). Structural leak detection (per-feature timestamp check) made primary; accuracy>75% demoted to secondary smell test. `weekly_forecasts` PK changed to `(forecast_for, model_version)` to preserve audit on retrain. `forecast_outcomes` made three-class. Layer 5 station-churn trigger gated until ≥180 days of stable polling. | Multipliers obscure calibration. Gates compose cleanly and stay auditable. | + +--- + +## References + +- Alquist, Kilian, Vigfusson (2013) — *Forecasting the Price of Oil* — + the academic basis for "no-change baseline beats most structural models + at <6m horizons" (which is why Phase 2 matters as a hard floor). +- BEIS *Weekly road fuel prices* CSV — the 435-week training set. +- `.claude/rules/scoring.md`, `.claude/rules/prediction.md` — the two + inconsistent rule files this spec replaces.