Documents complete replacement of six-signal aggregator with calibrated ridge forecaster trained on 435 weeks of BEIS pump prices. Five-layer architecture: weekly baseline (Layer 1), local snapshot (Layer 2), rule-gated verdict merger (Layer 3), daily LLM news
29 KiB
Prediction Rebuild — Design Spec
Context
The current prediction service (NationalFuelPredictionService + six signal
classes) produces output the user has repeatedly described as "doesn't make
sense": headlines that contradict their own reasoning text, weights that
nobody can defend a number on, and confidence values that aren't grounded in
any track record. Two earlier docs (.claude/rules/scoring.md, .claude/rules/prediction.md)
disagree on the weights of the same signals, which is itself evidence that
the design has drifted.
This spec replaces the entire prediction stack from scratch around the historical data we actually have, with a model whose confidence values are calibrated against its own backtested track record.
Goals:
- A "fill up now or wait?" call honest about uncertainty.
- Confidence values calibrated against backtested residuals — "70%" actually means "in 7 of every 10 cases like this, the model called direction right".
- Simple enough to debug a year from now.
- Remove the six-signal aggregator entirely.
- Recognise that pump prices, while measured weekly by BEIS, can move daily during oil shocks (Iran, OPEC surprise cuts, Hormuz disruption). The static weekly forecast must be backed by a daily news/event overlay so we can flag staleness in real time rather than pretend a Monday number is still valid on Thursday after a 6% Brent move.
Inputs (audited 2026-05-01)
| Source | Status | Use in v1 |
|---|---|---|
weekly_pump_prices |
435 weeks, all Mondays, 0 outliers, 1 duty change (Mar 2022, 57.95p → 52.95p), VAT stable at 20% | Foundation — train Layer 1 |
station_prices_current |
~7,550 stations × e10, ~7,620 × b7_standard | Layer 2 — descriptive snapshot |
stations |
7,747 stations, 1,989 supermarkets, lat/lng | Layer 2 |
station_prices |
75 days of changes since 2026-01-16, sample mix uneven per day | Not modelled in v1, but used by the volatility regime detector as a churn indicator (% stations changing price / day vs 30-day baseline). |
brent_prices |
30 days only | Backfilled in Phase 7 (8 years from FRED, single API call). Used as a Brent-move volatility trigger and as fuel for the daily LLM overlay. |
The Fuel Finder API has been confirmed empirically to have no historical
archive — effective-start-timestamp is a station-level filter on current
prices, not a time-window query. Per-station deep history can only accrue
forward from the date polling started.
Architecture — five thin layers
Layer 1 — National weekly forecaster (predictive, calibrated)
Trained once weekly on weekly_pump_prices. Output:
direction ∈ {rising, falling, flat}magnitude_pence— predicted Δ price next weekridge_confidence(0–100) — calibrated from backtested residuals, not from the model's raw output
This is the quantitative baseline. It updates only when the BEIS Monday publication arrives (so the forecast itself changes weekly), but its displayed confidence (Layer 3) is adjusted in real time by Layers 4 and 5.
direction = flat whenever |magnitude_pence| < FLAT_THRESHOLD. Phase 3
picks FLAT_THRESHOLD from the backtest residual distribution; the
starting value is 0.2p / litre.
Layer 2 — Local snapshot (descriptive, NOT predictive)
Pure SQL aggregates against station_prices_current + Haversine on
stations.lat/lng. No ML, no history, no surprises:
local_avg_50km(fuel_type, lat, lng)national_avg(fuel_type)cheapest_within(km, fuel_type, lat, lng)supermarket_avg_local,major_avg_local, gap
Layer 2 never speaks about the future. It describes the present.
Layer 3 — Verdict merger (rule-based gates, no multipliers)
Single user-facing verdict ∈ {fill_now, wait, no_signal}. The
displayed confidence number is ridge_confidence itself, untouched.
LLM agreement and volatility status are shown as separate badges, not
blended into the number. Honesty over smoothing.
Gates evaluated in order, first match wins:
1. direction == 'flat' → no_signal
2. ridge_confidence < 40 → no_signal
3. volatility_regime active → no_signal (badge: volatile)
4. LLM disagrees AND ridge_confidence < 75 → no_signal (badge: conflicting)
5. rising AND ridge_confidence >= 70 → fill_now
6. falling AND ridge_confidence >= 70 → wait
7. otherwise (40 <= conf < 70, no veto from 3 or 4) → dashboard-only
Why gates, not multipliers:
- A multiplied confidence number is a black-box blend that the user can't audit. A 70% that used to be 90% before today's volatility hit looks identical to a 70% that's been calibrated all along.
- Gates compose cleanly. Each rule has one job and is independently testable.
- The verdict is binary anyway (notify / don't / silent). Smoothing confidence under the hood doesn't help that decision — it only obscures it.
Layer 2 affects urgency wording only ("fill up now, especially in your area at 2p above national"). It never changes the verdict. Neither does Layer 4 or Layer 5 — they can suppress (gate 3, 4) but never flip the direction.
Layer 4 — Daily LLM news overlay (qualitative, news-aware)
Single scheduled call at 07:00 UK. Plus an event-driven refresh when Layer 5's volatility flag flips ON (with a 4-hour cooldown so the same event doesn't trigger repeatedly).
JSON in, JSON out. Calls Claude Haiku with web search enabled, asks for
direction + confidence + cited events with URLs. Stored in a new
llm_overlays table.
Layer 4 is read-only with respect to the volatility flag. It writes
its result row; only Layer 5 mutates volatility_regimes.active.
LLM confidence is hard-capped at 75 in code (web-searched LLMs are
systematically overconfident). Calls without events_cited are rejected.
Layer 5 — Volatility regime detector (intra-week safety net)
Hourly cron. Sole owner of the volatility_regimes.active flag.
Reads four signals, OR-combined:
- Daily Brent move > 3% close-to-close (FRED
DCOILBRENTEU, Phase 7). - Most recent
llm_overlays.major_impact_event = trueAND at least one verified URL. station_pricesdaily churn rate > 1.5× its 30-day baseline.- A
watched_eventsrow covering today (manually flagged geopolitical periods).
When the flag flips on:
- An event-driven LLM refresh is queued (Layer 4) if last run was > 4h ago.
- Layer 3's gate 3 fires: verdict forced to
no_signalwith thevolatilebadge. - The reasoning text appended: "Volatility detected ({trigger}) — this forecast may be stale within days."
When it flips off:
- Verdict returns to whatever the gates produce on the unchanged
ridge_confidence(no multiplier to reset — there are none). - Badge cleared.
- Next morning's 07:00 LLM call still runs (it always runs); no extra refreshes are queued.
Layer 5 never changes Layer 1's direction. It only suppresses the verdict via gate 3.
Methodology — Layer 1
Target
ΔULSP[t+1] = ULSP[t+1] − ULSP[t]
We model the change, not the level. UK pump prices are non-stationary, so regressing on levels gives spurious R² and useless coefficients. Differencing makes the series stationary.
Features (all stationary)
| Feature | Notes |
|---|---|
Δulsp_lag_0, Δulsp_lag_1, Δulsp_lag_3 |
1w / 2w / 4w momentum |
Δulsd_lag_0 |
Diesel cross-signal as a change |
ulsp[t] − ma8[t] |
Mean-reversion term — gap between current price and 8-week MA. Single most useful feature for 1-week-ahead UK pump forecast. |
week_of_year_sin, week_of_year_cos |
Cyclic seasonality encoding |
is_pre_bank_holiday |
Boolean, within 7 days of UK bank holiday |
The level only enters as the deviation from MA-8 (itself stationary). That's the only way levels are allowed in.
Duty change is NOT a feature. With one event in 435 weeks, n=1 cannot fit a meaningful coefficient. Instead, duty-change-adjacent weeks (±4 weeks of a known change) are handled in the calibration override (see below) — confidence is halved and the regime flag is surfaced in the reasoning text. A regime can be flagged. A coefficient cannot be trained from one observation.
Model
Ridge regression. Boring on purpose:
- 435 weekly observations is too few to beat a well-specified linear model out-of-sample with gradient boosting or LSTM — those would just fit noise.
- Interpretable coefficients are essential for the honesty layer (the reasoning text describes what the model used).
Upgrade to a non-linear model only if Phase 3 backtest demonstrates the linear model is missing real structure.
Training and evaluation split
- Train on weeks 1–305 (~70%).
- Evaluate on weeks 306–435 (~30%) with rolling-origin cross-validation (single-split would overfit hyperparameters to one window).
Confidence calibration
Two-stage calibration:
- Magnitude binning — bin predictions by predicted
|magnitude|and record actual hit rate per bin. The publishedconfidence_scorereads from this lookup, not from the model's raw output. - Regime flag — flag any forecast week within ±4 weeks of a known
duty change. With only one duty change in 435 weeks, statistical
stratification at n=1 is impossible. Instead:
- For flagged weeks, halve the calibrated confidence manually.
- Surface the flag in the reasoning text: "Recent duty change — forecast accuracy is reduced for the next several weeks."
This is the only place v1 accepts a hand-tuned guard, and it's there because the data can't tell us better.
Methodology — Layer 2
Pure aggregates. No model.
-- Local 50km average
SELECT AVG(price_pence) FROM station_prices_current
JOIN stations ON station_prices_current.station_id = stations.node_id
WHERE fuel_type = ? AND <Haversine within 50km of (lat, lng)>;
-- National average
SELECT AVG(price_pence) FROM station_prices_current WHERE fuel_type = ?;
-- Cheapest within 25km
SELECT stations.*, station_prices_current.price_pence
FROM station_prices_current
JOIN stations ON station_prices_current.station_id = stations.node_id
WHERE fuel_type = ? AND <Haversine within 25km>
ORDER BY price_pence ASC LIMIT 5;
-- Supermarket vs major split, locally
SELECT stations.is_supermarket, AVG(price_pence)
FROM station_prices_current
JOIN stations ON station_prices_current.station_id = stations.node_id
WHERE fuel_type = ? AND <Haversine within 25km>
GROUP BY stations.is_supermarket;
Output is descriptive: "Your area is X p above national average right now", "Cheapest near you: {station} at {price}", "Supermarkets near you: {avg} vs majors: {avg}". Never predictive language.
Methodology — Layer 3
Full gate ordering is in the Architecture section (Layer 3). Summary:
- Verdict via ordered rule gates, not multipliers.
ridge_confidenceis displayed verbatim — never multiplied.- Volatility flag and LLM disagreement act as suppressors with badges
(
volatile,conflicting) but never flip direction. direction == 'flat'always producesno_signal.- LLM disagreement only suppresses the verdict when
ridge_confidence < 75. Above 75 the model's call is strong enough to stand even with a news-scan disagreement (the LLM is hard-capped at 75 confidence anyway, so it can't out-confidence the ridge model — only flag a tension).
Local position from Layer 2 modifies urgency wording only:
- If user's local average is materially above national (>2p), and Layer 1 says "rising", urgency increased ("fill up now, especially in your area").
- Layer 2 never flips Layer 1's direction.
Methodology — Layer 4 (LLM news overlay)
Single scheduled call daily at 07:00 UK. Additional event-driven calls
are queued by Layer 5 when the volatility flag flips ON, with a 4-hour
cooldown enforced in code (skip the queue if the most recent
llm_overlays.ran_at is within 4 hours).
Brent input (brent_recent_14_days) is optional — passed as null
until Phase 7 backfills brent_prices. Phase 8 cannot ship before
Phase 7 — explicit dependency.
Request shape (JSON)
{
"input": {
"ulsp_recent_8_weeks": [...],
"brent_recent_14_days": [...],
"current_week_of_year": 18,
"days_to_next_bank_holiday": 5,
"duty_pence": 52.95,
"ridge_model_says": {
"direction": "down",
"confidence": 68,
"magnitude_pence": -0.4
}
},
"ask": "Search recent news for oil-supply, OPEC, refinery, shipping, sanctions, geopolitical events affecting UK retail fuel prices over the next 1-2 weeks. Reply ONLY in the schema below."
}
Response shape (JSON, enforced)
{
"direction": "rising | falling | flat",
"confidence": 0,
"reasoning_short": "1-2 sentences",
"events_cited": [
{"headline": "...", "source": "...", "url": "...", "impact": "rising|falling|neutral"}
],
"agrees_with_ridge": true,
"major_impact_event": false
}
Code-level guards (not in the prompt)
- Cap
confidenceat 75. Web-searched LLMs are systematically overconfident. - Reject the response if
events_citedis empty. Forces the LLM to ground its call in something checkable, not vibes. - Verify each
urlinevents_citedis reachable before storing. Catches hallucinated citations. Failed URLs blank the citation but don't reject the call (newer URLs sometimes 404 briefly). - Layer 4 does NOT mutate
volatility_regimes.active. It writes its row tollm_overlays(withmajor_impact_event+ verified URLs) and that's it. Layer 5's hourly cron picks up the new row and decides whether to flip the flag.
How Layer 3 uses it
- LLM agrees → no gating effect;
agreesbadge shown next to the verdict ("News scan agrees, citing {event}"). - LLM disagrees AND
ridge_confidence < 75→ gate 4 fires: verdict forced tono_signalwith theconflictingbadge. - LLM disagrees AND
ridge_confidence >= 75→ no suppression; the disagreement is shown as a badge but the model's strong call stands. - LLM neutral / flat → no gating effect.
- Direction is never flipped by the LLM.
Methodology — Layer 5 (volatility regime detector)
Hourly cron. Sole owner of volatility_regimes.active. Reads four
signals, OR-combined:
- Brent move — close-to-close daily Brent move > 3% on FRED
DCOILBRENTEU. FRED publishes with a one-day lag (today's value is yesterday's settle), so the trigger reflects the most recent settled day. Sufficient for v1 — we don't have a real-time Brent feed. - LLM major-impact flag — most recent
llm_overlaysrow hasmajor_impact_event = trueAND at least one verified URL. - Station churn — gated until ≥180 days of stable polling. The
trigger fires when the last-24h % of stations updating price exceeds
1.5× the 30-day rolling baseline. With only 75 days of uneven polling
(Jan 16 → May 1) the baseline is meaningless — sample-mix variance
would dominate any real shock signal. The trigger is implemented but
disabled in code via a feature flag; flip it on once
station_priceshas 180+ continuous days. - Manual
watched_events— a row covering today. Lets you flag known geopolitical periods manually (e.g. "Iran tensions Apr–May 2026").
When the flag flips on:
- An event-driven Layer 4 LLM refresh is queued (skipped if the most
recent
llm_overlays.ran_atis within 4 hours — cooldown). - Layer 3's gate 3 fires: verdict forced to
no_signalwith thevolatilebadge for as long as the flag stays on. - Reasoning text appended: "Volatility detected ({trigger label}) — this forecast may be stale within days."
When it flips off:
- Verdict returns to whatever the gates produce on the unchanged
ridge_confidence(no multiplier reset needed — there are no multipliers). - Badge cleared.
- The next morning's 07:00 LLM call still runs (always does); no extra refreshes are queued by Layer 5.
Schema deltas
Add
weekly_forecasts
id BIGINT PK
forecast_for DATE — Monday the forecast covers
model_version VARCHAR(32) — links back to backtests row
direction ENUM('rising','falling','flat')
magnitude_pence SMALLINT — predicted Δ × 100, signed
ridge_confidence TINYINT UNSIGNED — 0..100, calibrated from backtested residuals. Displayed verbatim. Layer 3 gates may suppress the verdict but never modify this number.
flagged_duty_change BOOLEAN — true if forecast is within ±4 weeks of a duty change (avoids collision with Layer 5's volatility_regimes)
reasoning TEXT — generated from features actually used
generated_at DATETIME
UNIQUE (forecast_for, model_version)
INDEX (forecast_for, generated_at DESC)
forecast_outcomes
forecast_for DATE
model_version VARCHAR(32)
predicted_class ENUM('rising','falling','flat')
actual_class ENUM('rising','falling','flat')
correct BOOLEAN
abs_error_pence SMALLINT UNSIGNED
resolved_at DATETIME
PRIMARY KEY (forecast_for, model_version)
backtests
id BIGINT PK
model_version VARCHAR(32) UNIQUE
features_json JSON — feature spec
train_start DATE
train_end DATE
eval_start DATE
eval_end DATE
directional_accuracy DECIMAL(5,2)
mae_pence DECIMAL(5,2)
calibration_table JSON — {bin_low..bin_high → empirical_hit_rate}
leak_suspected BOOLEAN — secondary smell test: true if directional_accuracy > 75. Primary leak detection is structural (see Backtest section).
ran_at DATETIME
llm_overlays
id BIGINT PK
ran_at DATETIME
forecast_for_week DATE — which weekly forecast it overlays
direction ENUM('rising','falling','flat')
confidence TINYINT UNSIGNED — capped 75 in code
reasoning TEXT
events_json JSON — cited events with verified URLs
agrees_with_ridge BOOLEAN
major_impact_event BOOLEAN
volatility_flag_on BOOLEAN — was the regime flag on at run time
search_used BOOLEAN
INDEX (forecast_for_week, ran_at)
volatility_regimes
id BIGINT PK
flipped_on_at DATETIME
flipped_off_at DATETIME NULL
trigger ENUM('brent_move','llm_event','station_churn','manual')
trigger_detail TEXT — e.g. "Brent +4.2% close-to-close"
active BOOLEAN
watched_events
id BIGINT PK
label VARCHAR(128)
starts_at DATETIME
ends_at DATETIME
notes TEXT
Keep
weekly_pump_prices— already loaded, source of truthstations,station_prices_current— for Layer 2station_prices— keep collecting forward, not modelled in v1
Deprecate (delete after Layer 1 ships)
price_predictions— old LLM/EWMA store, replaced byweekly_forecasts
The current six-signal aggregator (NationalFuelPredictionService and
app/Services/Prediction/Signals/*) is fully replaced, not extended.
Same JSON output keys (predicted_direction, confidence_score,
action, reasoning) so the Vue frontend doesn't break — engine swapped,
contract preserved.
Implementation phases (each ships something working)
| Phase | Scope | Ships |
|---|---|---|
| 1. Backtest harness | BacktestRunner service + backtests table. Takes a model class, train/eval split, returns directional accuracy + MAE + calibration curve. Structural leak detection built in (per-feature source-timestamp check vs target Monday); accuracy>75% smell test as secondary. |
A way to prove any future model works before shipping it. |
| 2. Naive baseline | "Predict next week = this week" implemented as a model class. Run through harness. | A floor: any future model must beat this. |
| 3. v1 ridge model | Features above (incl. mean-reversion term), trained once, persisted with model_version. WeeklyForecastService runs it. Backtest must clear the acceptance gate. |
First real forecast. Backtested numbers visible. |
| 4. Live wiring | Replace NationalFuelPredictionService internals with a thin adapter delegating to WeeklyForecastService. Same API shape, new engine. |
Frontend keeps working, predictions now from the new model. |
| 5. Local snapshot | LocalSnapshotService — pure aggregates. Wire into /api/stations payload alongside the headline forecast. |
"Your area" descriptive cards. |
| 6. Honesty layer | Reasoning generator describes what the model used: lag values, season, holiday flag. Shows backtest accuracy badge. Returns explicit "not enough data" when confidence < 40. Surfaces the duty-change-adjacent flag when set. | The "no BS" framing. |
| 7. Brent backfill + daily refresh | One FRED call (2018→today, ~2,150 daily rows). Daily refresh cron at 06:30 UK (must complete before Phase 8's 07:00 LLM call — sequenced so the LLM has fresh Brent context). Used by Phase 9's volatility detector and as a feature option for future model iterations (only added to the ridge model if backtested lift is ≥3 percentage points directional accuracy). | Daily Brent in DB. Foundation for volatility + LLM context. |
| 8. LLM news overlay | LlmOverlayService — single scheduled call at 07:00 UK (after Brent refresh). Plus event-driven calls when Layer 5 flips the volatility flag on, with 4h cooldown. JSON in / JSON out, web search enabled, results stored in llm_overlays. Feeds Layer 3's gate 4 (suppress when LLM disagrees AND ridge_confidence < 75) and the agrees/conflicting badges. URL-verification + empty-citation rejection enforced in code. Depends on Phase 7. |
News-aware verdict suppression and badge on top of the calibrated ridge baseline. |
| 9. Volatility regime detector | VolatilityRegimeService — hourly cron, sole owner of volatility_regimes.active. OR-combines four triggers: Brent move > 3%, LLM major_impact_event, station churn > 1.5× baseline (gated until ≥180 days of stable polling), watched_events row covering today. Fires Layer 3's gate 3 (verdict → no_signal with volatile badge) and the event-driven Layer 4 refresh. |
The intra-week safety net for oil shocks. |
Backtest acceptance gates (Phase 3 → Phase 4)
| Backtest result | Action |
|---|---|
| < 60% directional accuracy | Features are wrong. Stay in Phase 3, don't ship. |
| 60–62% | Marginal. One feature iteration, then re-evaluate. |
| 62–68% | Ship. Realistic target for UK weekly pump direction without Brent. |
| 68–75% | Excellent. Ship and watch closely. |
| > 75% | Stop. Run the structural leak detector. Almost certainly time leakage (e.g. using t+1 info accidentally in t features). The accuracy threshold is a secondary smell test, not the primary detector. |
| MAE > 1.0p / litre | Features are noisy. Refit before shipping. |
| Target MAE | 0.4–0.7p / litre. |
Structural leak detection (primary)
Built into the backtest harness. For every (training_week, feature_value) pair, the harness verifies the data source's effective timestamp is strictly before the target Monday. Any feature whose source timestamp is on or after the target week is treated as leakage and the backtest fails fast. This is independent of accuracy — it catches leakage even when it doesn't translate into suspiciously high accuracy.
The > 75% accuracy row is a secondary smell test for leakage modes the
structural check missed (e.g. label leakage via a downstream computed
column). Primary defence is the timestamp check. These numbers are
encoded in the harness as assertions, not aspirations.
Honesty rules — non-negotiables
- Backtest accuracy is published in the UI. The model wears its track record on its sleeve.
- Below 40 confidence, the recommendation is
no_signaland the reasoning says "we don't have enough signal to call it" — explicitly. No filler. - When duty-change-adjacent weeks affect the forecast, surface the flag ("forecast may be skewed by recent duty change").
- Reasoning text only references features the model actually used — no narrative invention. If the mean-reversion term drove the call, say so ("Pump prices are 3.1p above their 8-week average, and prices typically pull back from that level"). If the seasonality term drove it, say so.
forecast_outcomesis populated automatically when the next BEIS week lands. Hit rate over the trailing 13 weeks is shown next to the headline.- When the volatility regime flag is on, the UI shows the
volatilebadge and the trigger (e.g. "Brent up 4.2% yesterday — forecast may be stale within days"). Verdict is suppressed visibly via gate 3, never silently. - The LLM overlay is shown separately from the ridge model, never
blended. "Model says down (68%); news scan agrees, citing {event}" —
the
ridge_confidencenumber stays calibrated and untouched, while LLM and volatility status are presented as their own badges. - LLM citations with unreachable URLs are dropped from the displayed
reasoning but kept in
llm_overlays.events_jsonfor audit. We never show a citation we haven't verified.
What gets deleted at the end of Phase 4
app/Services/Prediction/Signals/*(whole directory)NationalFuelPredictionServiceinternals (kept as a thin wrapper, then renamed when the frontend migration completes)price_predictionstable — replaced byweekly_forecasts(ridge) +llm_overlays(news layer)OilPriceService::generatePrediction(), EWMA/LLM helpers — replaced byLlmOverlayService(Phase 8) which has a different contractOilPriceService::fetchBrentPrices()— kept and expanded in Phase 7 (backfill mode + daily refresh), not deleted.claude/rules/scoring.mdretired in favour of a fresh.claude/rules/forecasting.md.claude/rules/prediction.mdrewritten to match the new architecture
Open decisions (to confirm before Phase 1)
- Forecast cadence — the forecast itself is weekly (matches BEIS publication). The confidence and presentation update daily via Layer 4 (LLM) and Layer 5 (volatility regime). This split is deliberate — we refuse to fabricate intra-week movement, but we don't pretend a static Monday number is reliable on Thursday after a 6% Brent move.
- Scope — drop the six-signal aggregator entirely, confirmed.
- API shape — keep existing JSON output keys so Vue keeps working,
with the engine swapped under the hood. The original
confidence_scorefield maps toridge_confidence(calibrated, untouched). Add new fields:volatility({active, trigger}),news_overlay({direction, agreement, events}), andverdict_reason(which gate fired, if any). The verdict itself goes in the existingactionfield. - Brent — promoted to Phase 7 (was "optional, conditional"). Needed for the volatility detector, regardless of whether it's used in the ridge model.
- LLM — Anthropic Claude Haiku with web search. Single scheduled call at 07:00 UK (after the 06:30 Brent refresh). Plus event-driven refreshes when Layer 5 flips the volatility flag on, with a 4h cooldown. No fixed afternoon cron — by 13:00 UK, morning users have already made their fill-up decisions, so the value is too low to justify the extra noise. Hard confidence cap 75. Empty-citation rejection.
Changelog (substantive design decisions)
| When | Change | Why |
|---|---|---|
| 2026-05-01 v1 | Initial spec — three layers, six-signal aggregator removed, ridge model on BEIS weekly data | Replace incoherent NationalFuelPredictionService |
| 2026-05-01 v2 | Added Layer 4 (LLM news overlay) and Layer 5 (volatility regime detector). Pump prices can move daily during oil shocks; static weekly forecast must be backed by intra-week safety nets. | Iran/Hormuz-style shocks make a Monday-only confidence number stale by Wednesday |
| 2026-05-01 v3 | Verdict via rule gates, not multipliers. ridge_confidence displayed verbatim. LLM and volatility presented as badges. weeks_since_duty_change removed from features (kept as calibration override only — n=1 can't fit a coefficient). Backtest gate floor lowered 65 → 62 (realistic without Brent). Structural leak detection (per-feature timestamp check) made primary; accuracy>75% demoted to secondary smell test. weekly_forecasts PK changed to (forecast_for, model_version) to preserve audit on retrain. forecast_outcomes made three-class. Layer 5 station-churn trigger gated until ≥180 days of stable polling. |
Multipliers obscure calibration. Gates compose cleanly and stay auditable. |
References
- Alquist, Kilian, Vigfusson (2013) — Forecasting the Price of Oil — the academic basis for "no-change baseline beats most structural models at <6m horizons" (which is why Phase 2 matters as a hard floor).
- BEIS Weekly road fuel prices CSV — the 435-week training set.
.claude/rules/scoring.md,.claude/rules/prediction.md— the two inconsistent rule files this spec replaces.