feat: add prediction rebuild design spec — Layer 1 ridge model, LLM news overlay, volatility regime detector
Documents complete replacement of six-signal aggregator with calibrated ridge forecaster trained on 435 weeks of BEIS pump prices. Five-layer architecture: weekly baseline (Layer 1), local snapshot (Layer 2), rule-gated verdict merger (Layer 3), daily LLM news
This commit is contained in:
618
docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md
Normal file
618
docs/superpowers/specs/2026-05-01-prediction-rebuild-design.md
Normal file
@@ -0,0 +1,618 @@
|
|||||||
|
# Prediction Rebuild — Design Spec
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The current prediction service (`NationalFuelPredictionService` + six signal
|
||||||
|
classes) produces output the user has repeatedly described as "doesn't make
|
||||||
|
sense": headlines that contradict their own reasoning text, weights that
|
||||||
|
nobody can defend a number on, and confidence values that aren't grounded in
|
||||||
|
any track record. Two earlier docs (`.claude/rules/scoring.md`, `.claude/rules/prediction.md`)
|
||||||
|
disagree on the weights of the same signals, which is itself evidence that
|
||||||
|
the design has drifted.
|
||||||
|
|
||||||
|
This spec replaces the entire prediction stack from scratch around the
|
||||||
|
historical data we actually have, with a model whose confidence values are
|
||||||
|
calibrated against its own backtested track record.
|
||||||
|
|
||||||
|
Goals:
|
||||||
|
- A "fill up now or wait?" call honest about uncertainty.
|
||||||
|
- Confidence values calibrated against backtested residuals — "70%" actually
|
||||||
|
means "in 7 of every 10 cases like this, the model called direction right".
|
||||||
|
- Simple enough to debug a year from now.
|
||||||
|
- Remove the six-signal aggregator entirely.
|
||||||
|
- Recognise that pump prices, while *measured* weekly by BEIS, can *move* daily
|
||||||
|
during oil shocks (Iran, OPEC surprise cuts, Hormuz disruption). The static
|
||||||
|
weekly forecast must be backed by a daily news/event overlay so we can flag
|
||||||
|
staleness in real time rather than pretend a Monday number is still valid on
|
||||||
|
Thursday after a 6% Brent move.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Inputs (audited 2026-05-01)
|
||||||
|
|
||||||
|
| Source | Status | Use in v1 |
|
||||||
|
|---|---|---|
|
||||||
|
| `weekly_pump_prices` | 435 weeks, all Mondays, 0 outliers, 1 duty change (Mar 2022, 57.95p → 52.95p), VAT stable at 20% | **Foundation** — train Layer 1 |
|
||||||
|
| `station_prices_current` | ~7,550 stations × e10, ~7,620 × b7_standard | **Layer 2** — descriptive snapshot |
|
||||||
|
| `stations` | 7,747 stations, 1,989 supermarkets, lat/lng | Layer 2 |
|
||||||
|
| `station_prices` | 75 days of changes since 2026-01-16, sample mix uneven per day | Not modelled in v1, but **used by the volatility regime detector** as a churn indicator (% stations changing price / day vs 30-day baseline). |
|
||||||
|
| `brent_prices` | 30 days only | **Backfilled in Phase 7** (8 years from FRED, single API call). Used as a Brent-move volatility trigger and as fuel for the daily LLM overlay. |
|
||||||
|
|
||||||
|
The Fuel Finder API has been confirmed empirically to have **no historical
|
||||||
|
archive** — `effective-start-timestamp` is a station-level filter on current
|
||||||
|
prices, not a time-window query. Per-station deep history can only accrue
|
||||||
|
forward from the date polling started.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture — five thin layers
|
||||||
|
|
||||||
|
### Layer 1 — National weekly forecaster (predictive, calibrated)
|
||||||
|
|
||||||
|
Trained once weekly on `weekly_pump_prices`. Output:
|
||||||
|
|
||||||
|
- `direction ∈ {rising, falling, flat}`
|
||||||
|
- `magnitude_pence` — predicted Δ price next week
|
||||||
|
- `ridge_confidence` (0–100) — calibrated from backtested residuals, not
|
||||||
|
from the model's raw output
|
||||||
|
|
||||||
|
This is the **quantitative baseline**. It updates only when the BEIS Monday
|
||||||
|
publication arrives (so the *forecast itself* changes weekly), but its
|
||||||
|
*displayed confidence* (Layer 3) is adjusted in real time by Layers 4 and 5.
|
||||||
|
|
||||||
|
`direction = flat` whenever `|magnitude_pence| < FLAT_THRESHOLD`. Phase 3
|
||||||
|
picks `FLAT_THRESHOLD` from the backtest residual distribution; the
|
||||||
|
starting value is **0.2p / litre**.
|
||||||
|
|
||||||
|
### Layer 2 — Local snapshot (descriptive, NOT predictive)
|
||||||
|
|
||||||
|
Pure SQL aggregates against `station_prices_current` + Haversine on
|
||||||
|
`stations.lat/lng`. No ML, no history, no surprises:
|
||||||
|
|
||||||
|
- `local_avg_50km(fuel_type, lat, lng)`
|
||||||
|
- `national_avg(fuel_type)`
|
||||||
|
- `cheapest_within(km, fuel_type, lat, lng)`
|
||||||
|
- `supermarket_avg_local`, `major_avg_local`, gap
|
||||||
|
|
||||||
|
Layer 2 never speaks about the future. It describes the present.
|
||||||
|
|
||||||
|
### Layer 3 — Verdict merger (rule-based gates, no multipliers)
|
||||||
|
|
||||||
|
Single user-facing verdict ∈ {`fill_now`, `wait`, `no_signal`}. The
|
||||||
|
displayed confidence number is `ridge_confidence` itself, **untouched**.
|
||||||
|
LLM agreement and volatility status are shown as separate **badges**, not
|
||||||
|
blended into the number. Honesty over smoothing.
|
||||||
|
|
||||||
|
Gates evaluated in order, first match wins:
|
||||||
|
|
||||||
|
```
|
||||||
|
1. direction == 'flat' → no_signal
|
||||||
|
2. ridge_confidence < 40 → no_signal
|
||||||
|
3. volatility_regime active → no_signal (badge: volatile)
|
||||||
|
4. LLM disagrees AND ridge_confidence < 75 → no_signal (badge: conflicting)
|
||||||
|
5. rising AND ridge_confidence >= 70 → fill_now
|
||||||
|
6. falling AND ridge_confidence >= 70 → wait
|
||||||
|
7. otherwise (40 <= conf < 70, no veto from 3 or 4) → dashboard-only
|
||||||
|
```
|
||||||
|
|
||||||
|
Why gates, not multipliers:
|
||||||
|
|
||||||
|
- A multiplied confidence number is a black-box blend that the user can't
|
||||||
|
audit. A 70% that used to be 90% before today's volatility hit looks
|
||||||
|
identical to a 70% that's been calibrated all along.
|
||||||
|
- Gates compose cleanly. Each rule has one job and is independently
|
||||||
|
testable.
|
||||||
|
- The verdict is binary anyway (notify / don't / silent). Smoothing
|
||||||
|
confidence under the hood doesn't help that decision — it only obscures it.
|
||||||
|
|
||||||
|
Layer 2 affects **urgency wording only** ("fill up now, *especially* in
|
||||||
|
your area at 2p above national"). It never changes the verdict. Neither
|
||||||
|
does Layer 4 or Layer 5 — they can suppress (gate 3, 4) but never flip
|
||||||
|
the direction.
|
||||||
|
|
||||||
|
### Layer 4 — Daily LLM news overlay (qualitative, news-aware)
|
||||||
|
|
||||||
|
**Single scheduled call at 07:00 UK.** Plus an event-driven refresh when
|
||||||
|
Layer 5's volatility flag flips ON (with a 4-hour cooldown so the same
|
||||||
|
event doesn't trigger repeatedly).
|
||||||
|
|
||||||
|
JSON in, JSON out. Calls Claude Haiku with web search enabled, asks for
|
||||||
|
direction + confidence + cited events with URLs. Stored in a new
|
||||||
|
`llm_overlays` table.
|
||||||
|
|
||||||
|
Layer 4 is **read-only with respect to the volatility flag**. It writes
|
||||||
|
its result row; only Layer 5 mutates `volatility_regimes.active`.
|
||||||
|
|
||||||
|
LLM confidence is hard-capped at 75 in code (web-searched LLMs are
|
||||||
|
systematically overconfident). Calls without `events_cited` are rejected.
|
||||||
|
|
||||||
|
### Layer 5 — Volatility regime detector (intra-week safety net)
|
||||||
|
|
||||||
|
Hourly cron. **Sole owner** of the `volatility_regimes.active` flag.
|
||||||
|
Reads four signals, OR-combined:
|
||||||
|
|
||||||
|
1. Daily Brent move > 3% close-to-close (FRED `DCOILBRENTEU`, Phase 7).
|
||||||
|
2. Most recent `llm_overlays.major_impact_event = true` AND at least one
|
||||||
|
verified URL.
|
||||||
|
3. `station_prices` daily churn rate > 1.5× its 30-day baseline.
|
||||||
|
4. A `watched_events` row covering today (manually flagged geopolitical
|
||||||
|
periods).
|
||||||
|
|
||||||
|
When the flag flips on:
|
||||||
|
- An event-driven LLM refresh is queued (Layer 4) if last run was > 4h ago.
|
||||||
|
- **Layer 3's gate 3 fires**: verdict forced to `no_signal` with the
|
||||||
|
`volatile` badge.
|
||||||
|
- The reasoning text appended: *"Volatility detected ({trigger}) — this
|
||||||
|
forecast may be stale within days."*
|
||||||
|
|
||||||
|
When it flips off:
|
||||||
|
- Verdict returns to whatever the gates produce on the unchanged
|
||||||
|
`ridge_confidence` (no multiplier to reset — there are none).
|
||||||
|
- Badge cleared.
|
||||||
|
- Next morning's 07:00 LLM call still runs (it always runs); no extra
|
||||||
|
refreshes are queued.
|
||||||
|
|
||||||
|
Layer 5 never changes Layer 1's *direction*. It only suppresses the
|
||||||
|
verdict via gate 3.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Methodology — Layer 1
|
||||||
|
|
||||||
|
### Target
|
||||||
|
|
||||||
|
```
|
||||||
|
ΔULSP[t+1] = ULSP[t+1] − ULSP[t]
|
||||||
|
```
|
||||||
|
|
||||||
|
We model the **change**, not the level. UK pump prices are non-stationary,
|
||||||
|
so regressing on levels gives spurious R² and useless coefficients.
|
||||||
|
Differencing makes the series stationary.
|
||||||
|
|
||||||
|
### Features (all stationary)
|
||||||
|
|
||||||
|
| Feature | Notes |
|
||||||
|
|---|---|
|
||||||
|
| `Δulsp_lag_0`, `Δulsp_lag_1`, `Δulsp_lag_3` | 1w / 2w / 4w momentum |
|
||||||
|
| `Δulsd_lag_0` | Diesel cross-signal as a *change* |
|
||||||
|
| `ulsp[t] − ma8[t]` | **Mean-reversion term** — gap between current price and 8-week MA. Single most useful feature for 1-week-ahead UK pump forecast. |
|
||||||
|
| `week_of_year_sin`, `week_of_year_cos` | Cyclic seasonality encoding |
|
||||||
|
| `is_pre_bank_holiday` | Boolean, within 7 days of UK bank holiday |
|
||||||
|
|
||||||
|
The level only enters as the deviation from MA-8 (itself stationary).
|
||||||
|
That's the only way levels are allowed in.
|
||||||
|
|
||||||
|
**Duty change is NOT a feature.** With one event in 435 weeks, n=1 cannot
|
||||||
|
fit a meaningful coefficient. Instead, duty-change-adjacent weeks (±4
|
||||||
|
weeks of a known change) are handled in the **calibration override**
|
||||||
|
(see below) — confidence is halved and the regime flag is surfaced in
|
||||||
|
the reasoning text. A regime can be flagged. A coefficient cannot be
|
||||||
|
trained from one observation.
|
||||||
|
|
||||||
|
### Model
|
||||||
|
|
||||||
|
Ridge regression. Boring on purpose:
|
||||||
|
|
||||||
|
- 435 weekly observations is too few to beat a well-specified linear model
|
||||||
|
out-of-sample with gradient boosting or LSTM — those would just fit noise.
|
||||||
|
- Interpretable coefficients are essential for the honesty layer
|
||||||
|
(the reasoning text describes what the model used).
|
||||||
|
|
||||||
|
Upgrade to a non-linear model **only** if Phase 3 backtest demonstrates the
|
||||||
|
linear model is missing real structure.
|
||||||
|
|
||||||
|
### Training and evaluation split
|
||||||
|
|
||||||
|
- Train on weeks 1–305 (~70%).
|
||||||
|
- Evaluate on weeks 306–435 (~30%) with rolling-origin cross-validation
|
||||||
|
(single-split would overfit hyperparameters to one window).
|
||||||
|
|
||||||
|
### Confidence calibration
|
||||||
|
|
||||||
|
Two-stage calibration:
|
||||||
|
|
||||||
|
1. **Magnitude binning** — bin predictions by predicted `|magnitude|` and
|
||||||
|
record actual hit rate per bin. The published `confidence_score` reads
|
||||||
|
from this lookup, not from the model's raw output.
|
||||||
|
2. **Regime flag** — flag any forecast week within ±4 weeks of a known
|
||||||
|
duty change. With only one duty change in 435 weeks, statistical
|
||||||
|
stratification at n=1 is impossible. Instead:
|
||||||
|
- For flagged weeks, halve the calibrated confidence manually.
|
||||||
|
- Surface the flag in the reasoning text: *"Recent duty change —
|
||||||
|
forecast accuracy is reduced for the next several weeks."*
|
||||||
|
|
||||||
|
This is the only place v1 accepts a hand-tuned guard, and it's there
|
||||||
|
because the data can't tell us better.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Methodology — Layer 2
|
||||||
|
|
||||||
|
Pure aggregates. No model.
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Local 50km average
|
||||||
|
SELECT AVG(price_pence) FROM station_prices_current
|
||||||
|
JOIN stations ON station_prices_current.station_id = stations.node_id
|
||||||
|
WHERE fuel_type = ? AND <Haversine within 50km of (lat, lng)>;
|
||||||
|
|
||||||
|
-- National average
|
||||||
|
SELECT AVG(price_pence) FROM station_prices_current WHERE fuel_type = ?;
|
||||||
|
|
||||||
|
-- Cheapest within 25km
|
||||||
|
SELECT stations.*, station_prices_current.price_pence
|
||||||
|
FROM station_prices_current
|
||||||
|
JOIN stations ON station_prices_current.station_id = stations.node_id
|
||||||
|
WHERE fuel_type = ? AND <Haversine within 25km>
|
||||||
|
ORDER BY price_pence ASC LIMIT 5;
|
||||||
|
|
||||||
|
-- Supermarket vs major split, locally
|
||||||
|
SELECT stations.is_supermarket, AVG(price_pence)
|
||||||
|
FROM station_prices_current
|
||||||
|
JOIN stations ON station_prices_current.station_id = stations.node_id
|
||||||
|
WHERE fuel_type = ? AND <Haversine within 25km>
|
||||||
|
GROUP BY stations.is_supermarket;
|
||||||
|
```
|
||||||
|
|
||||||
|
Output is descriptive: "Your area is X p above national average right
|
||||||
|
now", "Cheapest near you: {station} at {price}", "Supermarkets near you:
|
||||||
|
{avg} vs majors: {avg}". **Never** predictive language.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Methodology — Layer 3
|
||||||
|
|
||||||
|
Full gate ordering is in the Architecture section (Layer 3). Summary:
|
||||||
|
|
||||||
|
- Verdict via ordered rule gates, **not** multipliers.
|
||||||
|
- `ridge_confidence` is displayed verbatim — never multiplied.
|
||||||
|
- Volatility flag and LLM disagreement act as **suppressors with badges**
|
||||||
|
(`volatile`, `conflicting`) but never flip direction.
|
||||||
|
- `direction == 'flat'` always produces `no_signal`.
|
||||||
|
- LLM disagreement only suppresses the verdict when `ridge_confidence < 75`.
|
||||||
|
Above 75 the model's call is strong enough to stand even with a news-scan
|
||||||
|
disagreement (the LLM is hard-capped at 75 confidence anyway, so it
|
||||||
|
can't out-confidence the ridge model — only flag a tension).
|
||||||
|
|
||||||
|
Local position from Layer 2 modifies urgency wording only:
|
||||||
|
|
||||||
|
- If user's local average is materially above national (>2p), and Layer 1
|
||||||
|
says "rising", urgency increased ("fill up now, *especially* in your area").
|
||||||
|
- Layer 2 never flips Layer 1's direction.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Methodology — Layer 4 (LLM news overlay)
|
||||||
|
|
||||||
|
Single scheduled call daily at 07:00 UK. Additional event-driven calls
|
||||||
|
are queued by Layer 5 when the volatility flag flips ON, with a 4-hour
|
||||||
|
cooldown enforced in code (skip the queue if the most recent
|
||||||
|
`llm_overlays.ran_at` is within 4 hours).
|
||||||
|
|
||||||
|
**Brent input** (`brent_recent_14_days`) is optional — passed as `null`
|
||||||
|
until Phase 7 backfills `brent_prices`. Phase 8 cannot ship before
|
||||||
|
Phase 7 — explicit dependency.
|
||||||
|
|
||||||
|
### Request shape (JSON)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"input": {
|
||||||
|
"ulsp_recent_8_weeks": [...],
|
||||||
|
"brent_recent_14_days": [...],
|
||||||
|
"current_week_of_year": 18,
|
||||||
|
"days_to_next_bank_holiday": 5,
|
||||||
|
"duty_pence": 52.95,
|
||||||
|
"ridge_model_says": {
|
||||||
|
"direction": "down",
|
||||||
|
"confidence": 68,
|
||||||
|
"magnitude_pence": -0.4
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"ask": "Search recent news for oil-supply, OPEC, refinery, shipping, sanctions, geopolitical events affecting UK retail fuel prices over the next 1-2 weeks. Reply ONLY in the schema below."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Response shape (JSON, enforced)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"direction": "rising | falling | flat",
|
||||||
|
"confidence": 0,
|
||||||
|
"reasoning_short": "1-2 sentences",
|
||||||
|
"events_cited": [
|
||||||
|
{"headline": "...", "source": "...", "url": "...", "impact": "rising|falling|neutral"}
|
||||||
|
],
|
||||||
|
"agrees_with_ridge": true,
|
||||||
|
"major_impact_event": false
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Code-level guards (not in the prompt)
|
||||||
|
|
||||||
|
1. **Cap `confidence` at 75.** Web-searched LLMs are systematically overconfident.
|
||||||
|
2. **Reject the response if `events_cited` is empty.** Forces the LLM to
|
||||||
|
ground its call in something checkable, not vibes.
|
||||||
|
3. **Verify each `url` in `events_cited` is reachable** before storing.
|
||||||
|
Catches hallucinated citations. Failed URLs blank the citation but
|
||||||
|
don't reject the call (newer URLs sometimes 404 briefly).
|
||||||
|
4. **Layer 4 does NOT mutate `volatility_regimes.active`.** It writes its
|
||||||
|
row to `llm_overlays` (with `major_impact_event` + verified URLs) and
|
||||||
|
that's it. Layer 5's hourly cron picks up the new row and decides
|
||||||
|
whether to flip the flag.
|
||||||
|
|
||||||
|
### How Layer 3 uses it
|
||||||
|
|
||||||
|
- LLM agrees → no gating effect; `agrees` badge shown next to the verdict
|
||||||
|
("News scan agrees, citing {event}").
|
||||||
|
- LLM disagrees AND `ridge_confidence < 75` → **gate 4 fires**: verdict
|
||||||
|
forced to `no_signal` with the `conflicting` badge.
|
||||||
|
- LLM disagrees AND `ridge_confidence >= 75` → no suppression; the
|
||||||
|
disagreement is shown as a badge but the model's strong call stands.
|
||||||
|
- LLM neutral / flat → no gating effect.
|
||||||
|
- Direction is never flipped by the LLM.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Methodology — Layer 5 (volatility regime detector)
|
||||||
|
|
||||||
|
Hourly cron. **Sole owner** of `volatility_regimes.active`. Reads four
|
||||||
|
signals, OR-combined:
|
||||||
|
|
||||||
|
1. **Brent move** — close-to-close daily Brent move > 3% on FRED
|
||||||
|
`DCOILBRENTEU`. FRED publishes with a one-day lag (today's value is
|
||||||
|
yesterday's settle), so the trigger reflects the most recent settled
|
||||||
|
day. Sufficient for v1 — we don't have a real-time Brent feed.
|
||||||
|
2. **LLM major-impact flag** — most recent `llm_overlays` row has
|
||||||
|
`major_impact_event = true` AND at least one verified URL.
|
||||||
|
3. **Station churn** — *gated until ≥180 days of stable polling.* The
|
||||||
|
trigger fires when the last-24h % of stations updating price exceeds
|
||||||
|
1.5× the 30-day rolling baseline. With only 75 days of uneven polling
|
||||||
|
(Jan 16 → May 1) the baseline is meaningless — sample-mix variance
|
||||||
|
would dominate any real shock signal. The trigger is implemented but
|
||||||
|
disabled in code via a feature flag; flip it on once `station_prices`
|
||||||
|
has 180+ continuous days.
|
||||||
|
4. **Manual `watched_events`** — a row covering today. Lets you flag
|
||||||
|
known geopolitical periods manually (e.g. "Iran tensions Apr–May 2026").
|
||||||
|
|
||||||
|
When the flag flips on:
|
||||||
|
|
||||||
|
- An event-driven Layer 4 LLM refresh is queued (skipped if the most
|
||||||
|
recent `llm_overlays.ran_at` is within 4 hours — cooldown).
|
||||||
|
- **Layer 3's gate 3 fires**: verdict forced to `no_signal` with the
|
||||||
|
`volatile` badge for as long as the flag stays on.
|
||||||
|
- Reasoning text appended: *"Volatility detected ({trigger label}) — this
|
||||||
|
forecast may be stale within days."*
|
||||||
|
|
||||||
|
When it flips off:
|
||||||
|
- Verdict returns to whatever the gates produce on the unchanged
|
||||||
|
`ridge_confidence` (no multiplier reset needed — there are no multipliers).
|
||||||
|
- Badge cleared.
|
||||||
|
- The next morning's 07:00 LLM call still runs (always does); no extra
|
||||||
|
refreshes are queued by Layer 5.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Schema deltas
|
||||||
|
|
||||||
|
### Add
|
||||||
|
|
||||||
|
```
|
||||||
|
weekly_forecasts
|
||||||
|
id BIGINT PK
|
||||||
|
forecast_for DATE — Monday the forecast covers
|
||||||
|
model_version VARCHAR(32) — links back to backtests row
|
||||||
|
direction ENUM('rising','falling','flat')
|
||||||
|
magnitude_pence SMALLINT — predicted Δ × 100, signed
|
||||||
|
ridge_confidence TINYINT UNSIGNED — 0..100, calibrated from backtested residuals. Displayed verbatim. Layer 3 gates may suppress the verdict but never modify this number.
|
||||||
|
flagged_duty_change BOOLEAN — true if forecast is within ±4 weeks of a duty change (avoids collision with Layer 5's volatility_regimes)
|
||||||
|
reasoning TEXT — generated from features actually used
|
||||||
|
generated_at DATETIME
|
||||||
|
UNIQUE (forecast_for, model_version)
|
||||||
|
INDEX (forecast_for, generated_at DESC)
|
||||||
|
|
||||||
|
forecast_outcomes
|
||||||
|
forecast_for DATE
|
||||||
|
model_version VARCHAR(32)
|
||||||
|
predicted_class ENUM('rising','falling','flat')
|
||||||
|
actual_class ENUM('rising','falling','flat')
|
||||||
|
correct BOOLEAN
|
||||||
|
abs_error_pence SMALLINT UNSIGNED
|
||||||
|
resolved_at DATETIME
|
||||||
|
PRIMARY KEY (forecast_for, model_version)
|
||||||
|
|
||||||
|
backtests
|
||||||
|
id BIGINT PK
|
||||||
|
model_version VARCHAR(32) UNIQUE
|
||||||
|
features_json JSON — feature spec
|
||||||
|
train_start DATE
|
||||||
|
train_end DATE
|
||||||
|
eval_start DATE
|
||||||
|
eval_end DATE
|
||||||
|
directional_accuracy DECIMAL(5,2)
|
||||||
|
mae_pence DECIMAL(5,2)
|
||||||
|
calibration_table JSON — {bin_low..bin_high → empirical_hit_rate}
|
||||||
|
leak_suspected BOOLEAN — secondary smell test: true if directional_accuracy > 75. Primary leak detection is structural (see Backtest section).
|
||||||
|
ran_at DATETIME
|
||||||
|
|
||||||
|
llm_overlays
|
||||||
|
id BIGINT PK
|
||||||
|
ran_at DATETIME
|
||||||
|
forecast_for_week DATE — which weekly forecast it overlays
|
||||||
|
direction ENUM('rising','falling','flat')
|
||||||
|
confidence TINYINT UNSIGNED — capped 75 in code
|
||||||
|
reasoning TEXT
|
||||||
|
events_json JSON — cited events with verified URLs
|
||||||
|
agrees_with_ridge BOOLEAN
|
||||||
|
major_impact_event BOOLEAN
|
||||||
|
volatility_flag_on BOOLEAN — was the regime flag on at run time
|
||||||
|
search_used BOOLEAN
|
||||||
|
INDEX (forecast_for_week, ran_at)
|
||||||
|
|
||||||
|
volatility_regimes
|
||||||
|
id BIGINT PK
|
||||||
|
flipped_on_at DATETIME
|
||||||
|
flipped_off_at DATETIME NULL
|
||||||
|
trigger ENUM('brent_move','llm_event','station_churn','manual')
|
||||||
|
trigger_detail TEXT — e.g. "Brent +4.2% close-to-close"
|
||||||
|
active BOOLEAN
|
||||||
|
|
||||||
|
watched_events
|
||||||
|
id BIGINT PK
|
||||||
|
label VARCHAR(128)
|
||||||
|
starts_at DATETIME
|
||||||
|
ends_at DATETIME
|
||||||
|
notes TEXT
|
||||||
|
```
|
||||||
|
|
||||||
|
### Keep
|
||||||
|
|
||||||
|
- `weekly_pump_prices` — already loaded, source of truth
|
||||||
|
- `stations`, `station_prices_current` — for Layer 2
|
||||||
|
- `station_prices` — keep collecting forward, not modelled in v1
|
||||||
|
|
||||||
|
### Deprecate (delete after Layer 1 ships)
|
||||||
|
|
||||||
|
- `price_predictions` — old LLM/EWMA store, replaced by `weekly_forecasts`
|
||||||
|
|
||||||
|
The current six-signal aggregator (`NationalFuelPredictionService` and
|
||||||
|
`app/Services/Prediction/Signals/*`) is **fully replaced**, not extended.
|
||||||
|
Same JSON output keys (`predicted_direction`, `confidence_score`,
|
||||||
|
`action`, `reasoning`) so the Vue frontend doesn't break — engine swapped,
|
||||||
|
contract preserved.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation phases (each ships something working)
|
||||||
|
|
||||||
|
| Phase | Scope | Ships |
|
||||||
|
|---|---|---|
|
||||||
|
| **1. Backtest harness** | `BacktestRunner` service + `backtests` table. Takes a model class, train/eval split, returns directional accuracy + MAE + calibration curve. **Structural leak detection** built in (per-feature source-timestamp check vs target Monday); accuracy>75% smell test as secondary. | A way to *prove* any future model works before shipping it. |
|
||||||
|
| **2. Naive baseline** | "Predict next week = this week" implemented as a model class. Run through harness. | A floor: any future model must beat this. |
|
||||||
|
| **3. v1 ridge model** | Features above (incl. mean-reversion term), trained once, persisted with `model_version`. `WeeklyForecastService` runs it. Backtest must clear the acceptance gate. | First real forecast. Backtested numbers visible. |
|
||||||
|
| **4. Live wiring** | Replace `NationalFuelPredictionService` internals with a thin adapter delegating to `WeeklyForecastService`. Same API shape, new engine. | Frontend keeps working, predictions now from the new model. |
|
||||||
|
| **5. Local snapshot** | `LocalSnapshotService` — pure aggregates. Wire into `/api/stations` payload alongside the headline forecast. | "Your area" descriptive cards. |
|
||||||
|
| **6. Honesty layer** | Reasoning generator describes *what the model used*: lag values, season, holiday flag. Shows backtest accuracy badge. Returns explicit "not enough data" when confidence < 40. Surfaces the duty-change-adjacent flag when set. | The "no BS" framing. |
|
||||||
|
| **7. Brent backfill + daily refresh** | One FRED call (2018→today, ~2,150 daily rows). Daily refresh cron at **06:30 UK** (must complete before Phase 8's 07:00 LLM call — sequenced so the LLM has fresh Brent context). Used by Phase 9's volatility detector and as a feature option for future model iterations (only added to the ridge model if backtested lift is ≥3 percentage points directional accuracy). | Daily Brent in DB. Foundation for volatility + LLM context. |
|
||||||
|
| **8. LLM news overlay** | `LlmOverlayService` — single scheduled call at **07:00 UK** (after Brent refresh). Plus event-driven calls when Layer 5 flips the volatility flag on, with 4h cooldown. JSON in / JSON out, web search enabled, results stored in `llm_overlays`. Feeds Layer 3's gate 4 (suppress when LLM disagrees AND ridge_confidence < 75) and the `agrees`/`conflicting` badges. URL-verification + empty-citation rejection enforced in code. **Depends on Phase 7.** | News-aware verdict suppression and badge on top of the calibrated ridge baseline. |
|
||||||
|
| **9. Volatility regime detector** | `VolatilityRegimeService` — hourly cron, sole owner of `volatility_regimes.active`. OR-combines four triggers: Brent move > 3%, LLM `major_impact_event`, station churn > 1.5× baseline (**gated until ≥180 days of stable polling**), `watched_events` row covering today. Fires Layer 3's gate 3 (verdict → `no_signal` with `volatile` badge) and the event-driven Layer 4 refresh. | The intra-week safety net for oil shocks. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Backtest acceptance gates (Phase 3 → Phase 4)
|
||||||
|
|
||||||
|
| Backtest result | Action |
|
||||||
|
|---|---|
|
||||||
|
| < 60% directional accuracy | Features are wrong. Stay in Phase 3, don't ship. |
|
||||||
|
| 60–62% | Marginal. One feature iteration, then re-evaluate. |
|
||||||
|
| **62–68%** | **Ship.** Realistic target for UK weekly pump direction without Brent. |
|
||||||
|
| 68–75% | Excellent. Ship and watch closely. |
|
||||||
|
| > 75% | **Stop.** Run the structural leak detector. Almost certainly time leakage (e.g. using `t+1` info accidentally in `t` features). The accuracy threshold is a secondary smell test, not the primary detector. |
|
||||||
|
| MAE > 1.0p / litre | Features are noisy. Refit before shipping. |
|
||||||
|
| Target MAE | 0.4–0.7p / litre. |
|
||||||
|
|
||||||
|
### Structural leak detection (primary)
|
||||||
|
|
||||||
|
Built into the backtest harness. For every (training_week, feature_value)
|
||||||
|
pair, the harness verifies the data source's effective timestamp is
|
||||||
|
**strictly before** the target Monday. Any feature whose source timestamp
|
||||||
|
is on or after the target week is treated as leakage and the backtest
|
||||||
|
fails fast. This is independent of accuracy — it catches leakage even
|
||||||
|
when it doesn't translate into suspiciously high accuracy.
|
||||||
|
|
||||||
|
The `> 75% accuracy` row is a secondary smell test for leakage modes the
|
||||||
|
structural check missed (e.g. label leakage via a downstream computed
|
||||||
|
column). Primary defence is the timestamp check. These numbers are
|
||||||
|
encoded in the harness as assertions, not aspirations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Honesty rules — non-negotiables
|
||||||
|
|
||||||
|
1. Backtest accuracy is **published in the UI**. The model wears its track
|
||||||
|
record on its sleeve.
|
||||||
|
2. Below 40 confidence, the recommendation is `no_signal` and the reasoning
|
||||||
|
says "we don't have enough signal to call it" — explicitly. No filler.
|
||||||
|
3. When duty-change-adjacent weeks affect the forecast, surface the flag
|
||||||
|
("forecast may be skewed by recent duty change").
|
||||||
|
4. Reasoning text only references features the model actually used — no
|
||||||
|
narrative invention. If the mean-reversion term drove the call, say so
|
||||||
|
("Pump prices are 3.1p above their 8-week average, and prices typically
|
||||||
|
pull back from that level"). If the seasonality term drove it, say so.
|
||||||
|
5. `forecast_outcomes` is populated automatically when the next BEIS week
|
||||||
|
lands. Hit rate over the trailing 13 weeks is shown next to the headline.
|
||||||
|
6. When the **volatility regime flag** is on, the UI shows the `volatile`
|
||||||
|
badge and the trigger (e.g. "Brent up 4.2% yesterday — forecast may be
|
||||||
|
stale within days"). Verdict is suppressed visibly via gate 3, never
|
||||||
|
silently.
|
||||||
|
7. The LLM overlay is **shown separately** from the ridge model, never
|
||||||
|
blended. "Model says down (68%); news scan agrees, citing {event}" —
|
||||||
|
the `ridge_confidence` number stays calibrated and untouched, while
|
||||||
|
LLM and volatility status are presented as their own badges.
|
||||||
|
8. LLM citations with unreachable URLs are **dropped from the displayed
|
||||||
|
reasoning** but kept in `llm_overlays.events_json` for audit. We never
|
||||||
|
show a citation we haven't verified.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What gets deleted at the end of Phase 4
|
||||||
|
|
||||||
|
- `app/Services/Prediction/Signals/*` (whole directory)
|
||||||
|
- `NationalFuelPredictionService` internals (kept as a thin wrapper, then
|
||||||
|
renamed when the frontend migration completes)
|
||||||
|
- `price_predictions` table — replaced by `weekly_forecasts` (ridge) +
|
||||||
|
`llm_overlays` (news layer)
|
||||||
|
- `OilPriceService::generatePrediction()`, EWMA/LLM helpers — replaced by
|
||||||
|
`LlmOverlayService` (Phase 8) which has a different contract
|
||||||
|
- `OilPriceService::fetchBrentPrices()` — kept and **expanded** in Phase 7
|
||||||
|
(backfill mode + daily refresh), not deleted
|
||||||
|
- `.claude/rules/scoring.md` retired in favour of a fresh
|
||||||
|
`.claude/rules/forecasting.md`
|
||||||
|
- `.claude/rules/prediction.md` rewritten to match the new architecture
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open decisions (to confirm before Phase 1)
|
||||||
|
|
||||||
|
- **Forecast cadence** — the *forecast itself* is weekly (matches BEIS
|
||||||
|
publication). The *confidence and presentation* update daily via Layer 4
|
||||||
|
(LLM) and Layer 5 (volatility regime). This split is deliberate — we
|
||||||
|
refuse to fabricate intra-week movement, but we don't pretend a static
|
||||||
|
Monday number is reliable on Thursday after a 6% Brent move.
|
||||||
|
- **Scope** — drop the six-signal aggregator entirely, confirmed.
|
||||||
|
- **API shape** — keep existing JSON output keys so Vue keeps working,
|
||||||
|
with the engine swapped under the hood. The original `confidence_score`
|
||||||
|
field maps to `ridge_confidence` (calibrated, untouched). Add new
|
||||||
|
fields: `volatility` (`{active, trigger}`), `news_overlay`
|
||||||
|
(`{direction, agreement, events}`), and `verdict_reason` (which gate
|
||||||
|
fired, if any). The verdict itself goes in the existing `action` field.
|
||||||
|
- **Brent** — promoted to Phase 7 (was "optional, conditional"). Needed
|
||||||
|
for the volatility detector, regardless of whether it's used in the
|
||||||
|
ridge model.
|
||||||
|
- **LLM** — Anthropic Claude Haiku with web search. Single scheduled call
|
||||||
|
at 07:00 UK (after the 06:30 Brent refresh). Plus event-driven refreshes
|
||||||
|
when Layer 5 flips the volatility flag on, with a 4h cooldown. No fixed
|
||||||
|
afternoon cron — by 13:00 UK, morning users have already made their
|
||||||
|
fill-up decisions, so the value is too low to justify the extra noise.
|
||||||
|
Hard confidence cap 75. Empty-citation rejection.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Changelog (substantive design decisions)
|
||||||
|
|
||||||
|
| When | Change | Why |
|
||||||
|
|---|---|---|
|
||||||
|
| 2026-05-01 v1 | Initial spec — three layers, six-signal aggregator removed, ridge model on BEIS weekly data | Replace incoherent `NationalFuelPredictionService` |
|
||||||
|
| 2026-05-01 v2 | Added Layer 4 (LLM news overlay) and Layer 5 (volatility regime detector). Pump prices can move daily during oil shocks; static weekly forecast must be backed by intra-week safety nets. | Iran/Hormuz-style shocks make a Monday-only confidence number stale by Wednesday |
|
||||||
|
| 2026-05-01 v3 | **Verdict via rule gates, not multipliers.** `ridge_confidence` displayed verbatim. LLM and volatility presented as badges. `weeks_since_duty_change` removed from features (kept as calibration override only — n=1 can't fit a coefficient). Backtest gate floor lowered 65 → 62 (realistic without Brent). Structural leak detection (per-feature timestamp check) made primary; accuracy>75% demoted to secondary smell test. `weekly_forecasts` PK changed to `(forecast_for, model_version)` to preserve audit on retrain. `forecast_outcomes` made three-class. Layer 5 station-churn trigger gated until ≥180 days of stable polling. | Multipliers obscure calibration. Gates compose cleanly and stay auditable. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Alquist, Kilian, Vigfusson (2013) — *Forecasting the Price of Oil* —
|
||||||
|
the academic basis for "no-change baseline beats most structural models
|
||||||
|
at <6m horizons" (which is why Phase 2 matters as a hard floor).
|
||||||
|
- BEIS *Weekly road fuel prices* CSV — the 435-week training set.
|
||||||
|
- `.claude/rules/scoring.md`, `.claude/rules/prediction.md` — the two
|
||||||
|
inconsistent rule files this spec replaces.
|
||||||
Reference in New Issue
Block a user