Files
fuel-price/docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md
Ovidiu U 25cf022964
Some checks failed
linter / quality (push) Has been cancelled
tests / ci (8.3) (push) Has been cancelled
tests / ci (8.4) (push) Has been cancelled
tests / ci (8.5) (push) Has been cancelled
feat: add prediction rebuild design spec — Layer 1 ridge model, LLM news overlay, volatility regime detector
Documents complete replacement of six-signal aggregator with calibrated
ridge forecaster trained on 435 weeks of BEIS pump prices. Five-layer
architecture: weekly baseline (Layer 1), local snapshot (Layer 2),
rule-gated verdict merger (Layer 3), daily LLM news
2026-05-01 13:23:10 +01:00

29 KiB
Raw Blame History

Prediction Rebuild — Design Spec

Context

The current prediction service (NationalFuelPredictionService + six signal classes) produces output the user has repeatedly described as "doesn't make sense": headlines that contradict their own reasoning text, weights that nobody can defend a number on, and confidence values that aren't grounded in any track record. Two earlier docs (.claude/rules/scoring.md, .claude/rules/prediction.md) disagree on the weights of the same signals, which is itself evidence that the design has drifted.

This spec replaces the entire prediction stack from scratch around the historical data we actually have, with a model whose confidence values are calibrated against its own backtested track record.

Goals:

  • A "fill up now or wait?" call honest about uncertainty.
  • Confidence values calibrated against backtested residuals — "70%" actually means "in 7 of every 10 cases like this, the model called direction right".
  • Simple enough to debug a year from now.
  • Remove the six-signal aggregator entirely.
  • Recognise that pump prices, while measured weekly by BEIS, can move daily during oil shocks (Iran, OPEC surprise cuts, Hormuz disruption). The static weekly forecast must be backed by a daily news/event overlay so we can flag staleness in real time rather than pretend a Monday number is still valid on Thursday after a 6% Brent move.

Inputs (audited 2026-05-01)

Source Status Use in v1
weekly_pump_prices 435 weeks, all Mondays, 0 outliers, 1 duty change (Mar 2022, 57.95p → 52.95p), VAT stable at 20% Foundation — train Layer 1
station_prices_current ~7,550 stations × e10, ~7,620 × b7_standard Layer 2 — descriptive snapshot
stations 7,747 stations, 1,989 supermarkets, lat/lng Layer 2
station_prices 75 days of changes since 2026-01-16, sample mix uneven per day Not modelled in v1, but used by the volatility regime detector as a churn indicator (% stations changing price / day vs 30-day baseline).
brent_prices 30 days only Backfilled in Phase 7 (8 years from FRED, single API call). Used as a Brent-move volatility trigger and as fuel for the daily LLM overlay.

The Fuel Finder API has been confirmed empirically to have no historical archiveeffective-start-timestamp is a station-level filter on current prices, not a time-window query. Per-station deep history can only accrue forward from the date polling started.


Architecture — five thin layers

Layer 1 — National weekly forecaster (predictive, calibrated)

Trained once weekly on weekly_pump_prices. Output:

  • direction ∈ {rising, falling, flat}
  • magnitude_pence — predicted Δ price next week
  • ridge_confidence (0100) — calibrated from backtested residuals, not from the model's raw output

This is the quantitative baseline. It updates only when the BEIS Monday publication arrives (so the forecast itself changes weekly), but its displayed confidence (Layer 3) is adjusted in real time by Layers 4 and 5.

direction = flat whenever |magnitude_pence| < FLAT_THRESHOLD. Phase 3 picks FLAT_THRESHOLD from the backtest residual distribution; the starting value is 0.2p / litre.

Layer 2 — Local snapshot (descriptive, NOT predictive)

Pure SQL aggregates against station_prices_current + Haversine on stations.lat/lng. No ML, no history, no surprises:

  • local_avg_50km(fuel_type, lat, lng)
  • national_avg(fuel_type)
  • cheapest_within(km, fuel_type, lat, lng)
  • supermarket_avg_local, major_avg_local, gap

Layer 2 never speaks about the future. It describes the present.

Layer 3 — Verdict merger (rule-based gates, no multipliers)

Single user-facing verdict ∈ {fill_now, wait, no_signal}. The displayed confidence number is ridge_confidence itself, untouched. LLM agreement and volatility status are shown as separate badges, not blended into the number. Honesty over smoothing.

Gates evaluated in order, first match wins:

1. direction == 'flat'                                 → no_signal
2. ridge_confidence < 40                               → no_signal
3. volatility_regime active                            → no_signal  (badge: volatile)
4. LLM disagrees AND ridge_confidence < 75             → no_signal  (badge: conflicting)
5. rising  AND ridge_confidence >= 70                  → fill_now
6. falling AND ridge_confidence >= 70                  → wait
7. otherwise (40 <= conf < 70, no veto from 3 or 4)    → dashboard-only

Why gates, not multipliers:

  • A multiplied confidence number is a black-box blend that the user can't audit. A 70% that used to be 90% before today's volatility hit looks identical to a 70% that's been calibrated all along.
  • Gates compose cleanly. Each rule has one job and is independently testable.
  • The verdict is binary anyway (notify / don't / silent). Smoothing confidence under the hood doesn't help that decision — it only obscures it.

Layer 2 affects urgency wording only ("fill up now, especially in your area at 2p above national"). It never changes the verdict. Neither does Layer 4 or Layer 5 — they can suppress (gate 3, 4) but never flip the direction.

Layer 4 — Daily LLM news overlay (qualitative, news-aware)

Single scheduled call at 07:00 UK. Plus an event-driven refresh when Layer 5's volatility flag flips ON (with a 4-hour cooldown so the same event doesn't trigger repeatedly).

JSON in, JSON out. Calls Claude Haiku with web search enabled, asks for direction + confidence + cited events with URLs. Stored in a new llm_overlays table.

Layer 4 is read-only with respect to the volatility flag. It writes its result row; only Layer 5 mutates volatility_regimes.active.

LLM confidence is hard-capped at 75 in code (web-searched LLMs are systematically overconfident). Calls without events_cited are rejected.

Layer 5 — Volatility regime detector (intra-week safety net)

Hourly cron. Sole owner of the volatility_regimes.active flag. Reads four signals, OR-combined:

  1. Daily Brent move > 3% close-to-close (FRED DCOILBRENTEU, Phase 7).
  2. Most recent llm_overlays.major_impact_event = true AND at least one verified URL.
  3. station_prices daily churn rate > 1.5× its 30-day baseline.
  4. A watched_events row covering today (manually flagged geopolitical periods).

When the flag flips on:

  • An event-driven LLM refresh is queued (Layer 4) if last run was > 4h ago.
  • Layer 3's gate 3 fires: verdict forced to no_signal with the volatile badge.
  • The reasoning text appended: "Volatility detected ({trigger}) — this forecast may be stale within days."

When it flips off:

  • Verdict returns to whatever the gates produce on the unchanged ridge_confidence (no multiplier to reset — there are none).
  • Badge cleared.
  • Next morning's 07:00 LLM call still runs (it always runs); no extra refreshes are queued.

Layer 5 never changes Layer 1's direction. It only suppresses the verdict via gate 3.


Methodology — Layer 1

Target

ΔULSP[t+1] = ULSP[t+1]  ULSP[t]

We model the change, not the level. UK pump prices are non-stationary, so regressing on levels gives spurious R² and useless coefficients. Differencing makes the series stationary.

Features (all stationary)

Feature Notes
Δulsp_lag_0, Δulsp_lag_1, Δulsp_lag_3 1w / 2w / 4w momentum
Δulsd_lag_0 Diesel cross-signal as a change
ulsp[t] ma8[t] Mean-reversion term — gap between current price and 8-week MA. Single most useful feature for 1-week-ahead UK pump forecast.
week_of_year_sin, week_of_year_cos Cyclic seasonality encoding
is_pre_bank_holiday Boolean, within 7 days of UK bank holiday

The level only enters as the deviation from MA-8 (itself stationary). That's the only way levels are allowed in.

Duty change is NOT a feature. With one event in 435 weeks, n=1 cannot fit a meaningful coefficient. Instead, duty-change-adjacent weeks (±4 weeks of a known change) are handled in the calibration override (see below) — confidence is halved and the regime flag is surfaced in the reasoning text. A regime can be flagged. A coefficient cannot be trained from one observation.

Model

Ridge regression. Boring on purpose:

  • 435 weekly observations is too few to beat a well-specified linear model out-of-sample with gradient boosting or LSTM — those would just fit noise.
  • Interpretable coefficients are essential for the honesty layer (the reasoning text describes what the model used).

Upgrade to a non-linear model only if Phase 3 backtest demonstrates the linear model is missing real structure.

Training and evaluation split

  • Train on weeks 1305 (~70%).
  • Evaluate on weeks 306435 (~30%) with rolling-origin cross-validation (single-split would overfit hyperparameters to one window).

Confidence calibration

Two-stage calibration:

  1. Magnitude binning — bin predictions by predicted |magnitude| and record actual hit rate per bin. The published confidence_score reads from this lookup, not from the model's raw output.
  2. Regime flag — flag any forecast week within ±4 weeks of a known duty change. With only one duty change in 435 weeks, statistical stratification at n=1 is impossible. Instead:
    • For flagged weeks, halve the calibrated confidence manually.
    • Surface the flag in the reasoning text: "Recent duty change — forecast accuracy is reduced for the next several weeks."

This is the only place v1 accepts a hand-tuned guard, and it's there because the data can't tell us better.


Methodology — Layer 2

Pure aggregates. No model.

-- Local 50km average
SELECT AVG(price_pence) FROM station_prices_current
JOIN stations ON station_prices_current.station_id = stations.node_id
WHERE fuel_type = ? AND <Haversine within 50km of (lat, lng)>;

-- National average
SELECT AVG(price_pence) FROM station_prices_current WHERE fuel_type = ?;

-- Cheapest within 25km
SELECT stations.*, station_prices_current.price_pence
FROM station_prices_current
JOIN stations ON station_prices_current.station_id = stations.node_id
WHERE fuel_type = ? AND <Haversine within 25km>
ORDER BY price_pence ASC LIMIT 5;

-- Supermarket vs major split, locally
SELECT stations.is_supermarket, AVG(price_pence)
FROM station_prices_current
JOIN stations ON station_prices_current.station_id = stations.node_id
WHERE fuel_type = ? AND <Haversine within 25km>
GROUP BY stations.is_supermarket;

Output is descriptive: "Your area is X p above national average right now", "Cheapest near you: {station} at {price}", "Supermarkets near you: {avg} vs majors: {avg}". Never predictive language.


Methodology — Layer 3

Full gate ordering is in the Architecture section (Layer 3). Summary:

  • Verdict via ordered rule gates, not multipliers.
  • ridge_confidence is displayed verbatim — never multiplied.
  • Volatility flag and LLM disagreement act as suppressors with badges (volatile, conflicting) but never flip direction.
  • direction == 'flat' always produces no_signal.
  • LLM disagreement only suppresses the verdict when ridge_confidence < 75. Above 75 the model's call is strong enough to stand even with a news-scan disagreement (the LLM is hard-capped at 75 confidence anyway, so it can't out-confidence the ridge model — only flag a tension).

Local position from Layer 2 modifies urgency wording only:

  • If user's local average is materially above national (>2p), and Layer 1 says "rising", urgency increased ("fill up now, especially in your area").
  • Layer 2 never flips Layer 1's direction.

Methodology — Layer 4 (LLM news overlay)

Single scheduled call daily at 07:00 UK. Additional event-driven calls are queued by Layer 5 when the volatility flag flips ON, with a 4-hour cooldown enforced in code (skip the queue if the most recent llm_overlays.ran_at is within 4 hours).

Brent input (brent_recent_14_days) is optional — passed as null until Phase 7 backfills brent_prices. Phase 8 cannot ship before Phase 7 — explicit dependency.

Request shape (JSON)

{
  "input": {
    "ulsp_recent_8_weeks": [...],
    "brent_recent_14_days": [...],
    "current_week_of_year": 18,
    "days_to_next_bank_holiday": 5,
    "duty_pence": 52.95,
    "ridge_model_says": {
      "direction": "down",
      "confidence": 68,
      "magnitude_pence": -0.4
    }
  },
  "ask": "Search recent news for oil-supply, OPEC, refinery, shipping, sanctions, geopolitical events affecting UK retail fuel prices over the next 1-2 weeks. Reply ONLY in the schema below."
}

Response shape (JSON, enforced)

{
  "direction": "rising | falling | flat",
  "confidence": 0,
  "reasoning_short": "1-2 sentences",
  "events_cited": [
    {"headline": "...", "source": "...", "url": "...", "impact": "rising|falling|neutral"}
  ],
  "agrees_with_ridge": true,
  "major_impact_event": false
}

Code-level guards (not in the prompt)

  1. Cap confidence at 75. Web-searched LLMs are systematically overconfident.
  2. Reject the response if events_cited is empty. Forces the LLM to ground its call in something checkable, not vibes.
  3. Verify each url in events_cited is reachable before storing. Catches hallucinated citations. Failed URLs blank the citation but don't reject the call (newer URLs sometimes 404 briefly).
  4. Layer 4 does NOT mutate volatility_regimes.active. It writes its row to llm_overlays (with major_impact_event + verified URLs) and that's it. Layer 5's hourly cron picks up the new row and decides whether to flip the flag.

How Layer 3 uses it

  • LLM agrees → no gating effect; agrees badge shown next to the verdict ("News scan agrees, citing {event}").
  • LLM disagrees AND ridge_confidence < 75gate 4 fires: verdict forced to no_signal with the conflicting badge.
  • LLM disagrees AND ridge_confidence >= 75 → no suppression; the disagreement is shown as a badge but the model's strong call stands.
  • LLM neutral / flat → no gating effect.
  • Direction is never flipped by the LLM.

Methodology — Layer 5 (volatility regime detector)

Hourly cron. Sole owner of volatility_regimes.active. Reads four signals, OR-combined:

  1. Brent move — close-to-close daily Brent move > 3% on FRED DCOILBRENTEU. FRED publishes with a one-day lag (today's value is yesterday's settle), so the trigger reflects the most recent settled day. Sufficient for v1 — we don't have a real-time Brent feed.
  2. LLM major-impact flag — most recent llm_overlays row has major_impact_event = true AND at least one verified URL.
  3. Station churngated until ≥180 days of stable polling. The trigger fires when the last-24h % of stations updating price exceeds 1.5× the 30-day rolling baseline. With only 75 days of uneven polling (Jan 16 → May 1) the baseline is meaningless — sample-mix variance would dominate any real shock signal. The trigger is implemented but disabled in code via a feature flag; flip it on once station_prices has 180+ continuous days.
  4. Manual watched_events — a row covering today. Lets you flag known geopolitical periods manually (e.g. "Iran tensions AprMay 2026").

When the flag flips on:

  • An event-driven Layer 4 LLM refresh is queued (skipped if the most recent llm_overlays.ran_at is within 4 hours — cooldown).
  • Layer 3's gate 3 fires: verdict forced to no_signal with the volatile badge for as long as the flag stays on.
  • Reasoning text appended: "Volatility detected ({trigger label}) — this forecast may be stale within days."

When it flips off:

  • Verdict returns to whatever the gates produce on the unchanged ridge_confidence (no multiplier reset needed — there are no multipliers).
  • Badge cleared.
  • The next morning's 07:00 LLM call still runs (always does); no extra refreshes are queued by Layer 5.

Schema deltas

Add

weekly_forecasts
  id                   BIGINT PK
  forecast_for         DATE                — Monday the forecast covers
  model_version        VARCHAR(32)         — links back to backtests row
  direction            ENUM('rising','falling','flat')
  magnitude_pence      SMALLINT            — predicted Δ × 100, signed
  ridge_confidence     TINYINT UNSIGNED    — 0..100, calibrated from backtested residuals. Displayed verbatim. Layer 3 gates may suppress the verdict but never modify this number.
  flagged_duty_change  BOOLEAN             — true if forecast is within ±4 weeks of a duty change (avoids collision with Layer 5's volatility_regimes)
  reasoning            TEXT                — generated from features actually used
  generated_at         DATETIME
  UNIQUE (forecast_for, model_version)
  INDEX  (forecast_for, generated_at DESC)

forecast_outcomes
  forecast_for      DATE
  model_version     VARCHAR(32)
  predicted_class   ENUM('rising','falling','flat')
  actual_class      ENUM('rising','falling','flat')
  correct           BOOLEAN
  abs_error_pence   SMALLINT UNSIGNED
  resolved_at       DATETIME
  PRIMARY KEY (forecast_for, model_version)

backtests
  id                 BIGINT PK
  model_version      VARCHAR(32) UNIQUE
  features_json      JSON                — feature spec
  train_start        DATE
  train_end          DATE
  eval_start         DATE
  eval_end           DATE
  directional_accuracy DECIMAL(5,2)
  mae_pence          DECIMAL(5,2)
  calibration_table  JSON                — {bin_low..bin_high → empirical_hit_rate}
  leak_suspected     BOOLEAN             — secondary smell test: true if directional_accuracy > 75. Primary leak detection is structural (see Backtest section).
  ran_at             DATETIME

llm_overlays
  id                  BIGINT PK
  ran_at              DATETIME
  forecast_for_week   DATE                — which weekly forecast it overlays
  direction           ENUM('rising','falling','flat')
  confidence          TINYINT UNSIGNED    — capped 75 in code
  reasoning           TEXT
  events_json         JSON                — cited events with verified URLs
  agrees_with_ridge   BOOLEAN
  major_impact_event  BOOLEAN
  volatility_flag_on  BOOLEAN             — was the regime flag on at run time
  search_used         BOOLEAN
  INDEX (forecast_for_week, ran_at)

volatility_regimes
  id                  BIGINT PK
  flipped_on_at       DATETIME
  flipped_off_at      DATETIME NULL
  trigger             ENUM('brent_move','llm_event','station_churn','manual')
  trigger_detail      TEXT                — e.g. "Brent +4.2% close-to-close"
  active              BOOLEAN

watched_events
  id                  BIGINT PK
  label               VARCHAR(128)
  starts_at           DATETIME
  ends_at             DATETIME
  notes               TEXT

Keep

  • weekly_pump_prices — already loaded, source of truth
  • stations, station_prices_current — for Layer 2
  • station_prices — keep collecting forward, not modelled in v1

Deprecate (delete after Layer 1 ships)

  • price_predictions — old LLM/EWMA store, replaced by weekly_forecasts

The current six-signal aggregator (NationalFuelPredictionService and app/Services/Prediction/Signals/*) is fully replaced, not extended. Same JSON output keys (predicted_direction, confidence_score, action, reasoning) so the Vue frontend doesn't break — engine swapped, contract preserved.


Implementation phases (each ships something working)

Phase Scope Ships
1. Backtest harness BacktestRunner service + backtests table. Takes a model class, train/eval split, returns directional accuracy + MAE + calibration curve. Structural leak detection built in (per-feature source-timestamp check vs target Monday); accuracy>75% smell test as secondary. A way to prove any future model works before shipping it.
2. Naive baseline "Predict next week = this week" implemented as a model class. Run through harness. A floor: any future model must beat this.
3. v1 ridge model Features above (incl. mean-reversion term), trained once, persisted with model_version. WeeklyForecastService runs it. Backtest must clear the acceptance gate. First real forecast. Backtested numbers visible.
4. Live wiring Replace NationalFuelPredictionService internals with a thin adapter delegating to WeeklyForecastService. Same API shape, new engine. Frontend keeps working, predictions now from the new model.
5. Local snapshot LocalSnapshotService — pure aggregates. Wire into /api/stations payload alongside the headline forecast. "Your area" descriptive cards.
6. Honesty layer Reasoning generator describes what the model used: lag values, season, holiday flag. Shows backtest accuracy badge. Returns explicit "not enough data" when confidence < 40. Surfaces the duty-change-adjacent flag when set. The "no BS" framing.
7. Brent backfill + daily refresh One FRED call (2018→today, ~2,150 daily rows). Daily refresh cron at 06:30 UK (must complete before Phase 8's 07:00 LLM call — sequenced so the LLM has fresh Brent context). Used by Phase 9's volatility detector and as a feature option for future model iterations (only added to the ridge model if backtested lift is ≥3 percentage points directional accuracy). Daily Brent in DB. Foundation for volatility + LLM context.
8. LLM news overlay LlmOverlayService — single scheduled call at 07:00 UK (after Brent refresh). Plus event-driven calls when Layer 5 flips the volatility flag on, with 4h cooldown. JSON in / JSON out, web search enabled, results stored in llm_overlays. Feeds Layer 3's gate 4 (suppress when LLM disagrees AND ridge_confidence < 75) and the agrees/conflicting badges. URL-verification + empty-citation rejection enforced in code. Depends on Phase 7. News-aware verdict suppression and badge on top of the calibrated ridge baseline.
9. Volatility regime detector VolatilityRegimeService — hourly cron, sole owner of volatility_regimes.active. OR-combines four triggers: Brent move > 3%, LLM major_impact_event, station churn > 1.5× baseline (gated until ≥180 days of stable polling), watched_events row covering today. Fires Layer 3's gate 3 (verdict → no_signal with volatile badge) and the event-driven Layer 4 refresh. The intra-week safety net for oil shocks.

Backtest acceptance gates (Phase 3 → Phase 4)

Backtest result Action
< 60% directional accuracy Features are wrong. Stay in Phase 3, don't ship.
6062% Marginal. One feature iteration, then re-evaluate.
6268% Ship. Realistic target for UK weekly pump direction without Brent.
6875% Excellent. Ship and watch closely.
> 75% Stop. Run the structural leak detector. Almost certainly time leakage (e.g. using t+1 info accidentally in t features). The accuracy threshold is a secondary smell test, not the primary detector.
MAE > 1.0p / litre Features are noisy. Refit before shipping.
Target MAE 0.40.7p / litre.

Structural leak detection (primary)

Built into the backtest harness. For every (training_week, feature_value) pair, the harness verifies the data source's effective timestamp is strictly before the target Monday. Any feature whose source timestamp is on or after the target week is treated as leakage and the backtest fails fast. This is independent of accuracy — it catches leakage even when it doesn't translate into suspiciously high accuracy.

The > 75% accuracy row is a secondary smell test for leakage modes the structural check missed (e.g. label leakage via a downstream computed column). Primary defence is the timestamp check. These numbers are encoded in the harness as assertions, not aspirations.


Honesty rules — non-negotiables

  1. Backtest accuracy is published in the UI. The model wears its track record on its sleeve.
  2. Below 40 confidence, the recommendation is no_signal and the reasoning says "we don't have enough signal to call it" — explicitly. No filler.
  3. When duty-change-adjacent weeks affect the forecast, surface the flag ("forecast may be skewed by recent duty change").
  4. Reasoning text only references features the model actually used — no narrative invention. If the mean-reversion term drove the call, say so ("Pump prices are 3.1p above their 8-week average, and prices typically pull back from that level"). If the seasonality term drove it, say so.
  5. forecast_outcomes is populated automatically when the next BEIS week lands. Hit rate over the trailing 13 weeks is shown next to the headline.
  6. When the volatility regime flag is on, the UI shows the volatile badge and the trigger (e.g. "Brent up 4.2% yesterday — forecast may be stale within days"). Verdict is suppressed visibly via gate 3, never silently.
  7. The LLM overlay is shown separately from the ridge model, never blended. "Model says down (68%); news scan agrees, citing {event}" — the ridge_confidence number stays calibrated and untouched, while LLM and volatility status are presented as their own badges.
  8. LLM citations with unreachable URLs are dropped from the displayed reasoning but kept in llm_overlays.events_json for audit. We never show a citation we haven't verified.

What gets deleted at the end of Phase 4

  • app/Services/Prediction/Signals/* (whole directory)
  • NationalFuelPredictionService internals (kept as a thin wrapper, then renamed when the frontend migration completes)
  • price_predictions table — replaced by weekly_forecasts (ridge) + llm_overlays (news layer)
  • OilPriceService::generatePrediction(), EWMA/LLM helpers — replaced by LlmOverlayService (Phase 8) which has a different contract
  • OilPriceService::fetchBrentPrices() — kept and expanded in Phase 7 (backfill mode + daily refresh), not deleted
  • .claude/rules/scoring.md retired in favour of a fresh .claude/rules/forecasting.md
  • .claude/rules/prediction.md rewritten to match the new architecture

Open decisions (to confirm before Phase 1)

  • Forecast cadence — the forecast itself is weekly (matches BEIS publication). The confidence and presentation update daily via Layer 4 (LLM) and Layer 5 (volatility regime). This split is deliberate — we refuse to fabricate intra-week movement, but we don't pretend a static Monday number is reliable on Thursday after a 6% Brent move.
  • Scope — drop the six-signal aggregator entirely, confirmed.
  • API shape — keep existing JSON output keys so Vue keeps working, with the engine swapped under the hood. The original confidence_score field maps to ridge_confidence (calibrated, untouched). Add new fields: volatility ({active, trigger}), news_overlay ({direction, agreement, events}), and verdict_reason (which gate fired, if any). The verdict itself goes in the existing action field.
  • Brent — promoted to Phase 7 (was "optional, conditional"). Needed for the volatility detector, regardless of whether it's used in the ridge model.
  • LLM — Anthropic Claude Haiku with web search. Single scheduled call at 07:00 UK (after the 06:30 Brent refresh). Plus event-driven refreshes when Layer 5 flips the volatility flag on, with a 4h cooldown. No fixed afternoon cron — by 13:00 UK, morning users have already made their fill-up decisions, so the value is too low to justify the extra noise. Hard confidence cap 75. Empty-citation rejection.

Changelog (substantive design decisions)

When Change Why
2026-05-01 v1 Initial spec — three layers, six-signal aggregator removed, ridge model on BEIS weekly data Replace incoherent NationalFuelPredictionService
2026-05-01 v2 Added Layer 4 (LLM news overlay) and Layer 5 (volatility regime detector). Pump prices can move daily during oil shocks; static weekly forecast must be backed by intra-week safety nets. Iran/Hormuz-style shocks make a Monday-only confidence number stale by Wednesday
2026-05-01 v3 Verdict via rule gates, not multipliers. ridge_confidence displayed verbatim. LLM and volatility presented as badges. weeks_since_duty_change removed from features (kept as calibration override only — n=1 can't fit a coefficient). Backtest gate floor lowered 65 → 62 (realistic without Brent). Structural leak detection (per-feature timestamp check) made primary; accuracy>75% demoted to secondary smell test. weekly_forecasts PK changed to (forecast_for, model_version) to preserve audit on retrain. forecast_outcomes made three-class. Layer 5 station-churn trigger gated until ≥180 days of stable polling. Multipliers obscure calibration. Gates compose cleanly and stay auditable.

References

  • Alquist, Kilian, Vigfusson (2013) — Forecasting the Price of Oil — the academic basis for "no-change baseline beats most structural models at <6m horizons" (which is why Phase 2 matters as a hard floor).
  • BEIS Weekly road fuel prices CSV — the 435-week training set.
  • .claude/rules/scoring.md, .claude/rules/prediction.md — the two inconsistent rule files this spec replaces.