library(tidyverse)
library(tidymodels)
library(arrow)
library(glue)
library(tidyquant)
library(imputeTS)
library(iml)
library(car) # VIF
library(lmtest) # Breusch-Pagan test
library(sandwich) # HC-robust standard errors
library(tseries) # ADF unit root test
library(timeDate) # German public holidays
library(conflicted)
conflicts_prefer(dplyr::first)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::lag)
conflicts_prefer(dplyr::filter)
conflicts_prefer(purrr::map)
conflicts_prefer(recipes::update)
theme_report <- theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(colour = "grey40"),
strip.text = element_text(face = "bold"),
panel.grid.minor = element_blank()
)
theme_set(theme_report)
knitr::opts_chunk$set(fig.width = 10, fig.height = 6, cache.lazy = FALSE)Price Decomposition
Attributing Day-Ahead Price Variance to Renewable Variability
1.0 Introduction
This page is the analytical centrepiece of the project. We take hourly generation, load, cross-border flow, gas price, and weather data for the DE-LU bidding zone and decompose day-ahead electricity price variance into its key drivers: renewable penetration, residual demand, interconnector flows, fuel costs, thermal dispatch, weather conditions, and temporal patterns. The goal is to answer a single question with precision:
How much of day-ahead price variance is attributable to renewable generation variability — and how does this change by season?
2.0 Data
2.1 SMARD (DE-LU)
Show code
smard <- read_parquet("data/processed/smard/smard_DE-LU_2020_2025.parquet")
cat(
"SMARD rows:", nrow(smard), "\n",
"Range:", as.character(min(smard$datetime_utc)),
"to", as.character(max(smard$datetime_utc)), "\n",
"Columns:", paste(names(smard), collapse = ", ")
)SMARD rows: 52584
Range: 2020-01-05 23:00:00 to 2026-01-04 22:00:00
Columns: datetime_utc, wind_onshore, wind_offshore, solar, biomass, gas, hard_coal, lignite, nuclear, total_load, price_de_lu
2.2 Cross-Border Flows (ENTSO-E)
Show code
flows <- read_parquet("data/processed/entsoe/crossborder_flows_2020_2025.parquet")
cat(
"Flow rows:", nrow(flows), "\n",
"Pairs:", ncol(flows) - 1, "\n"
)Flow rows: 168816
Pairs: 22
Show code
# ENTSO-E switched from hourly to 15-min reporting during 2021
flows |>
mutate(year = year(datetime_utc)) |>
count(year, name = "rows") |>
mutate(resolution = if_else(rows > 8784 * 2, "15-min", "hourly")) |>
print()# A tibble: 6 × 3
year rows resolution
<dbl> <int> <chr>
1 2020 8784 hourly
2 2021 19776 15-min
3 2022 35040 15-min
4 2023 35040 15-min
5 2024 35136 15-min
6 2025 35040 15-min
2.3 TTF Gas Price (Yahoo Finance)
Daily Dutch TTF front-month settlement in EUR/MWh, pulled via tidyquant. Saved to parquet on first run to avoid re-downloading on subsequent renders.
A production system would source TTF from a market data vendor (e.g. ICE, EEX). For this portfolio project, Yahoo Finance provides the same front-month TTF series at daily granularity, which is the natural resolution for a gas price proxy — settlement prices don’t move intra-day.
Show code
ttf_path <- "data/processed/ttf/ttf_daily.parquet"
if (!file.exists(ttf_path)) {
ttf_raw <- tq_get("TTF=F", get = "stock.prices", from = "2020-01-01")
ttf_daily <- ttf_raw |>
transmute(date, ttf_eur_mwh = adjusted) |>
filter(!is.na(ttf_eur_mwh))
dir.create(dirname(ttf_path), recursive = TRUE, showWarnings = FALSE)
write_parquet(ttf_daily, ttf_path)
cat("TTF downloaded and saved:", nrow(ttf_daily), "trading days\n")
} else {
ttf_daily <- read_parquet(ttf_path)
cat("TTF loaded from cache:", nrow(ttf_daily), "trading days\n")
}TTF loaded from cache: 1569 trading days
Show code
# Compute daily TTF change before joining to hourly
ttf_daily <- ttf_daily |>
arrange(date) |>
mutate(ttf_change = ttf_eur_mwh - lag(ttf_eur_mwh))
cat(
"Range:", as.character(min(ttf_daily$date)),
"to", as.character(max(ttf_daily$date)), "\n",
"TTF min:", round(min(ttf_daily$ttf_eur_mwh), 1),
"| max:", round(max(ttf_daily$ttf_eur_mwh), 1), "EUR/MWh"
)Range: 2020-01-02 to 2026-03-27
TTF min: 3.5 | max: 339.2 EUR/MWh
2.3.1 Stationarity Check — ADF Test
A unit root in the price series would undermine OLS inference. We apply the Augmented Dickey-Fuller test to the raw hourly price series. Given the strong intra-day and seasonal periodicity in electricity prices, we expect the series to be stationary — confirming that regression on levels (rather than differences) is appropriate.
Show code
adf_price <- tseries::adf.test(na.omit(smard$price_de_lu), alternative = "stationary")
cat(
"Augmented Dickey-Fuller Test\n",
"Statistic:", round(adf_price$statistic, 3), "\n",
"p-value:", format.pval(adf_price$p.value, digits = 3), "\n",
"Lag order:", adf_price$parameter, "\n\n"
)Augmented Dickey-Fuller Test
Statistic: -10.651
p-value: 0.01
Lag order: 37
Show code
if (adf_price$p.value < 0.05) {
cat("Result: null hypothesis of unit root rejected (p < 0.05).\n",
"The price series is stationary — regression on levels is appropriate.\n")
} else {
cat("Result: cannot reject unit root. Consider differencing or cointegration analysis.\n")
}Result: null hypothesis of unit root rejected (p < 0.05).
The price series is stationary — regression on levels is appropriate.
Show code
# Also test TTF
adf_ttf <- tseries::adf.test(na.omit(ttf_daily$ttf_eur_mwh), alternative = "stationary")
cat(
"\nTTF ADF Test\n",
"Statistic:", round(adf_ttf$statistic, 3), "\n",
"p-value:", format.pval(adf_ttf$p.value, digits = 3), "\n"
)
TTF ADF Test
Statistic: -2.211
p-value: 0.489
Electricity spot prices are mean-reverting by construction — the day-ahead auction clears every hour, and demand cycles anchor prices to a daily and seasonal rhythm. The ADF test confirms this: we can safely regress price levels on explanatory variables without introducing spurious relationships. The TTF series may show weaker stationarity (the 2022 crisis produced a level shift), but since TTF enters our model as a daily covariate rather than a dependent variable, this is not a concern for the price regression itself.
2.4 ERA5 Weather (DE-LU)
ERA5 reanalysis provides hourly solar irradiance and temperature for the DE-LU zone. Rather than separate HDD/CDD variables (which have zero variance in opposite seasons), we use a centred temperature transform: temp_centered = T - 18 captures direction (negative = heating demand, positive = cooling demand) while temp_squared captures the U-shaped relationship where both extreme cold and extreme heat drive prices up. Both features have variance year-round.
Show code
era5 <- read_parquet("data/processed/era5/DE-LU.parquet")
cat(
"ERA5 rows:", nrow(era5), "\n",
"Range:", as.character(min(era5$datetime_utc)),
"to", as.character(max(era5$datetime_utc)), "\n",
"Columns:", paste(names(era5), collapse = ", ")
)ERA5 rows: 52608
Range: 2019-12-31 18:00:00 to 2025-12-31 17:00:00
Columns: wind_speed_10m, wind_speed_100m, temperature_2m, ssrd_wm2, datetime_utc
Show code
# Check temperature units — ERA5 raw is Kelvin, processed should be Celsius
temp_range <- range(era5$temperature_2m, na.rm = TRUE)
cat("Temperature range:", round(temp_range[1], 1), "to", round(temp_range[2], 1), "\n")Temperature range: -12.4 to 34.5
Show code
# Convert from Kelvin if needed
if (temp_range[2] > 100) {
era5 <- era5 |> mutate(temperature_2m = temperature_2m - 273.15)
cat("Converted from Kelvin to Celsius\n")
}
# Derive centred temperature features (base 18°C)
era5 <- era5 |>
mutate(
temp_centered = temperature_2m - 18,
temp_squared = temp_centered^2
)
cat(
"temp_centered range:", round(min(era5$temp_centered, na.rm = TRUE), 1),
"to", round(max(era5$temp_centered, na.rm = TRUE), 1), "\n",
"SSRD range:", round(min(era5$ssrd_wm2, na.rm = TRUE), 1),
"to", round(max(era5$ssrd_wm2, na.rm = TRUE), 1), "W/m²"
)temp_centered range: -30.4 to 16.5
SSRD range: 0 to 843.7 W/m²
3.0 Exploratory Analysis — Raw Inputs
Before constructing any features, we inspect the raw price series, its distributional shape, temporal structure, and relationship to gas-fired generation.
3.1 Price Time Series
Show code
smard |>
ggplot(aes(datetime_utc, price_de_lu)) +
geom_line(alpha = 0.15, linewidth = 0.15, colour = "grey40") +
geom_smooth(method = "loess", span = 0.05, se = FALSE,
colour = "#E63946", linewidth = 0.8) +
labs(title = "DE-LU Day-Ahead Electricity Price",
x = NULL, y = "EUR/MWh") +
scale_y_continuous(labels = scales::comma_format())
3.2 Price Distribution by Year
Show code
smard |>
mutate(year = factor(year(datetime_utc))) |>
ggplot(aes(price_de_lu, fill = year, colour = year)) +
geom_density(alpha = 0.25, linewidth = 0.5) +
coord_cartesian(xlim = c(-50, 400)) +
labs(title = "Price Distribution by Year",
x = "EUR/MWh", y = "Density", fill = "Year", colour = "Year") +
scale_fill_brewer(palette = "Set2") +
scale_colour_brewer(palette = "Set2")
3.3 Negative Prices
Show code
smard |>
mutate(year = year(datetime_utc)) |>
summarise(
n_negative = sum(price_de_lu < 0, na.rm = TRUE),
pct_negative = mean(price_de_lu < 0, na.rm = TRUE),
min_price = min(price_de_lu, na.rm = TRUE),
.by = year
) |>
mutate(pct_negative = scales::percent(pct_negative, accuracy = 0.1)) |>
knitr::kable(col.names = c("Year", "Negative hours", "% Negative", "Min EUR/MWh"),
digits = 1)| Year | Negative hours | % Negative | Min EUR/MWh |
|---|---|---|---|
| 2020 | 298 | 3.4% | -83.9 |
| 2021 | 139 | 1.6% | -69.0 |
| 2022 | 70 | 0.8% | -19.0 |
| 2023 | 300 | 3.4% | -500.0 |
| 2024 | 457 | 5.2% | -135.4 |
| 2025 | 573 | 6.5% | -250.3 |
| 2026 | 2 | 2.1% | 0.0 |
Show code
smard |>
filter(price_de_lu < 0) |>
ggplot(aes(datetime_utc, price_de_lu, colour = price_de_lu)) +
geom_point(size = 0.8, alpha = 0.7) +
scale_colour_viridis_c(option = "plasma", direction = -1, name = "EUR/MWh") +
labs(title = "Negative Price Events — DE-LU",
x = NULL, y = "EUR/MWh") +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50")
Negative prices reflect the inflexibility cost of conventional generation. When renewable output exceeds demand, thermal plants that cannot ramp down quickly (lignite, nuclear, combined heat-and-power under must-run obligations) effectively pay to remain connected. For a renewable generator without a power purchase agreement, these hours represent direct revenue losses. Negative prices are preserved in the model — they are not excluded or transformed away, because they represent a structurally important feature of electricity markets that any hedging strategy must account for.
3.4 Seasonal Heatmap (Hour x Day-of-Week)
Show code
smard |>
mutate(
hour = hour(datetime_utc),
dow = wday(datetime_utc, label = TRUE, week_start = 1),
season = case_when(
month(datetime_utc) %in% c(12, 1, 2) ~ "Winter",
month(datetime_utc) %in% 3:5 ~ "Spring",
month(datetime_utc) %in% 6:8 ~ "Summer",
month(datetime_utc) %in% 9:11 ~ "Autumn"
) |> factor(levels = c("Winter", "Spring", "Summer", "Autumn"))
) |>
summarise(avg_price = mean(price_de_lu, na.rm = TRUE), .by = c(season, dow, hour)) |>
ggplot(aes(hour, dow, fill = avg_price)) +
geom_tile() +
facet_wrap(~ season) +
scale_fill_viridis_c(option = "inferno", name = "EUR/MWh") +
labs(title = "Average Hourly Price — DE-LU",
subtitle = "By day-of-week, hour, and season (2020-2025)",
x = "Hour (UTC)", y = NULL)
3.5 Price vs Gas-Fired Generation
Show code
smard |>
ggplot(aes(gas, price_de_lu)) +
geom_hex(bins = 80) +
geom_smooth(method = "lm", se = FALSE, colour = "#E63946", linewidth = 0.7) +
scale_fill_viridis_c(option = "mako", trans = "log10", name = "Count") +
labs(title = "Price vs Gas-Fired Generation",
x = "Gas Generation (MW)", y = "Price (EUR/MWh)")
3.6 TTF Gas Price
Show code
ttf_daily |>
ggplot(aes(date, ttf_eur_mwh)) +
geom_line(colour = "#457B9D", linewidth = 0.4) +
labs(title = "TTF Natural Gas — Front-Month Settlement",
x = NULL, y = "EUR/MWh")
4.0 Feature Engineering
We construct the variables that enter the price regression. Each captures a distinct channel through which supply and demand shape the clearing price:
- Renewable share — wind + solar as a fraction of total load. A merit-order shifter: higher renewable share pushes expensive thermal plants off the stack.
- Residual load — total load minus renewable generation. The demand that must be met by dispatchable (mostly thermal) capacity.
- Net export balance — total physical exports minus imports across DE-LU borders. Positive = Germany is a net exporter (tightens domestic supply).
- TTF gas price — the marginal fuel cost for gas-fired generation, which frequently sets the day-ahead clearing price.
- TTF daily change — the day-over-day change in TTF, capturing gas price momentum and traders’ response to intra-week fuel cost moves.
- Thermal generation — combined output of biomass, gas, hard coal, and lignite plants. Directly measures dispatchable thermal commitment.
- Renewable ramp rate — hour-over-hour change in renewable share. Rapid ramps force costly thermal redispatch and produce price spikes.
- Solar irradiance (SSRD) — surface solar radiation from ERA5. Captures the weather-driven supply channel independent of installed capacity growth.
- Centred temperature (temp_centered, temp_squared) — weather-driven demand proxies derived from ERA5 2-metre temperature, centred at 18°C. The linear term captures heating/cooling direction; the squared term captures the U-shaped demand response.
- German public holidays — binary flag for national holidays, which shift demand patterns independently of day-of-week.
Show code
# --- German public holidays (federal, all states) ---
de_holidays <- function(years) {
hols <- character()
for (y in years) {
# Moving holidays from timeDate
moving <- as.Date(c(
timeDate::GoodFriday(y),
timeDate::EasterMonday(y),
timeDate::DEAscension(y),
timeDate::PentecostMonday(y),
timeDate::DEGermanUnity(y)
))
# Fixed holidays
fixed <- as.Date(as.timeDate(paste0(y, c("-01-01", "-05-01", "-12-25", "-12-26"))))
hols <- c(hols, as.character(c(moving, fixed)))
}
as.Date(unique(hols))
}
holiday_dates <- de_holidays(2020:2025)
cat("German public holidays generated:", length(holiday_dates), "days\n")German public holidays generated: 54 days
Show code
# --- Identify DE-LU flow columns ---
flow_names <- names(flows) |> str_subset("datetime", negate = TRUE)
export_cols <- flow_names[str_detect(flow_names, "^DE[_-]?LU")]
import_cols <- flow_names[str_detect(flow_names, "DE[_-]?LU$")]
cat("Export columns:", length(export_cols), "\n")Export columns: 9
Show code
cat("Import columns:", length(import_cols), "\n")Import columns: 9
Show code
# --- Compute net exports ---
# ENTSO-E flows switch from hourly to 15-minute resolution partway through
# 2021 (2022 onward is fully quarter-hourly). Aggregate to hourly means
# before joining with the hourly SMARD data.
net_exports <- flows |>
mutate(
total_export = rowSums(pick(all_of(export_cols)), na.rm = TRUE),
total_import = rowSums(pick(all_of(import_cols)), na.rm = TRUE),
net_export_mw = total_export - total_import,
datetime_utc = floor_date(datetime_utc, unit = "hour")
) |>
summarise(net_export_mw = mean(net_export_mw), .by = datetime_utc)
cat("Net exports after hourly aggregation:", nrow(net_exports), "rows\n")Net exports after hourly aggregation: 52608 rows
Show code
# --- Build feature table ---
price_df <- smard |>
mutate(
renewable_gen = wind_onshore + wind_offshore + solar,
renewable_share = renewable_gen / total_load,
residual_load = total_load - renewable_gen,
thermal_gen = biomass + gas + hard_coal + lignite,
date = as_date(datetime_utc),
month = month(datetime_utc),
hour = hour(datetime_utc),
dow = wday(datetime_utc, label = TRUE, week_start = 1),
season = case_when(
month %in% c(12, 1, 2) ~ "Winter",
month %in% 3:5 ~ "Spring",
month %in% 6:8 ~ "Summer",
month %in% 9:11 ~ "Autumn"
) |> factor(levels = c("Winter", "Spring", "Summer", "Autumn")),
is_holiday = as.integer(date %in% holiday_dates)
) |>
# Join net exports
left_join(net_exports, by = "datetime_utc") |>
# Join TTF (daily step series — forward-fill weekends/holidays)
left_join(ttf_daily, by = "date") |>
arrange(datetime_utc) |>
fill(ttf_eur_mwh, .direction = "down") |>
fill(ttf_change, .direction = "down") |>
# Join ERA5 weather
left_join(
era5 |> select(datetime_utc, ssrd_wm2, temp_centered, temp_squared),
by = "datetime_utc"
) |>
# Renewable ramp rate (hour-over-hour change in renewable share)
mutate(renewable_ramp = renewable_share - lag(renewable_share))
# --- Summary ---
cat("\nFeature table:", nrow(price_df), "rows\n")
Feature table: 52584 rows
Show code
cat("NAs after join:\n")NAs after join:
Show code
price_df |>
summarise(across(c(renewable_share, residual_load, net_export_mw, ttf_eur_mwh,
ttf_change, thermal_gen, renewable_ramp, ssrd_wm2, temp_centered, temp_squared),
~ sum(is.na(.)))) |>
glimpse()Rows: 1
Columns: 10
$ renewable_share <int> 0
$ residual_load <int> 0
$ net_export_mw <int> 95
$ ttf_eur_mwh <int> 1
$ ttf_change <int> 1
$ thermal_gen <int> 0
$ renewable_ramp <int> 1
$ ssrd_wm2 <int> 166
$ temp_centered <int> 95
$ temp_squared <int> 95
# Impute remaining feature gaps rather than dropping rows.
# TTF: back-fill leading NAs (before first trading day); the forward-fill
# above already handles weekends/holidays.
# Net exports, renewable share, residual load, thermal gen: Kalman smoothing
# on a structural time series model (imputeTS) — respects the
# autocorrelation structure of hourly energy data better than linear
# interpolation.
# Weather features (ssrd, temp): Kalman smoothing.
# Renewable ramp: re-compute after imputing renewable share.
# Nuclear is NOT imputed — post-April-2023 NAs/zeros reflect the actual
# phase-out of German nuclear generation.
price_df <- price_df |>
arrange(datetime_utc) |>
fill(ttf_eur_mwh, .direction = "downup") |>
fill(ttf_change, .direction = "downup") |>
mutate(across(
c(renewable_share, residual_load, net_export_mw, thermal_gen,
ssrd_wm2, temp_centered, temp_squared),
~ imputeTS::na_kalman(., model = "StructTS", smooth = TRUE)
)) |>
# Re-derive renewable ramp after imputation
mutate(renewable_ramp = renewable_share - lag(renewable_share)) |>
# Fill the first-row NA from lag
mutate(renewable_ramp = replace_na(renewable_ramp, 0))
remaining_na <- price_df |>
summarise(across(c(price_de_lu, renewable_share, residual_load,
net_export_mw, ttf_eur_mwh, ttf_change,
thermal_gen, renewable_ramp, ssrd_wm2, temp_centered, temp_squared),
~ sum(is.na(.))))
cat("Rows retained:", nrow(price_df), "\n")Rows retained: 52584
cat("Remaining NAs after imputation:\n")Remaining NAs after imputation:
glimpse(remaining_na)Rows: 1
Columns: 11
$ price_de_lu <int> 0
$ renewable_share <int> 0
$ residual_load <int> 0
$ net_export_mw <int> 0
$ ttf_eur_mwh <int> 0
$ ttf_change <int> 0
$ thermal_gen <int> 0
$ renewable_ramp <int> 0
$ ssrd_wm2 <int> 0
$ temp_centered <int> 0
$ temp_squared <int> 0
# Drop any rows where price itself is missing (target variable — can't impute)
price_df <- price_df |> drop_na(price_de_lu)
cat("Final rows (after dropping missing prices):", nrow(price_df))Final rows (after dropping missing prices): 52584
5.0 Exploratory Analysis — Constructed Features
With all features constructed, we examine their bivariate relationships with the day-ahead price and their inter-correlations.
5.2 Price vs Residual Load
Show code
price_df |>
ggplot(aes(residual_load, price_de_lu)) +
geom_hex(bins = 60) +
geom_smooth(method = "lm", se = FALSE, colour = "#E63946", linewidth = 0.7) +
facet_wrap(~ season) +
scale_fill_viridis_c(option = "mako", trans = "log10", name = "Count") +
labs(title = "Price vs Residual Load — by Season",
x = "Residual Load (MW)", y = "EUR/MWh")
5.3 Price vs Net Exports
Show code
price_df |>
ggplot(aes(net_export_mw, price_de_lu)) +
geom_hex(bins = 60) +
geom_smooth(method = "lm", se = FALSE, colour = "#E63946", linewidth = 0.7) +
facet_wrap(~ season) +
scale_fill_viridis_c(option = "mako", trans = "log10", name = "Count") +
labs(title = "Price vs Net Exports — by Season",
x = "Net Exports (MW, positive = exporting)", y = "EUR/MWh")
5.4 Feature Correlation Matrix
Show code
cor_vars <- price_df |>
select(price_de_lu, renewable_share, residual_load, net_export_mw,
ttf_eur_mwh, ttf_change, thermal_gen, renewable_ramp,
ssrd_wm2, temp_centered, temp_squared)
cor_mat <- cor(cor_vars, use = "complete.obs")
cor_mat |>
as_tibble(rownames = "var1") |>
pivot_longer(-var1, names_to = "var2", values_to = "r") |>
mutate(
var1 = str_replace_all(var1, "_", " "),
var2 = str_replace_all(var2, "_", " ")
) |>
ggplot(aes(var1, var2, fill = r)) +
geom_tile() +
geom_text(aes(label = round(r, 2)), size = 2.8) +
scale_fill_gradient2(low = "#457B9D", mid = "white", high = "#E63946",
midpoint = 0, limits = c(-1, 1), name = "r") +
labs(title = "Feature Correlation Matrix", x = NULL, y = NULL) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
6.0 Train / Test Split
We use an 80/20 temporal split — the model is trained on the earlier period and evaluated on the most recent data. This respects time ordering and prevents data leakage.
split_date <- quantile(as.numeric(price_df$datetime_utc), 0.8) |>
as.POSIXct(origin = "1970-01-01", tz = "UTC")
train_df <- price_df |> filter(datetime_utc < split_date)
test_df <- price_df |> filter(datetime_utc >= split_date)
cat(
"Train:", nrow(train_df), "rows |",
as.character(min(train_df$datetime_utc)), "to",
as.character(max(train_df$datetime_utc)), "\n",
"Test: ", nrow(test_df), "rows |",
as.character(min(test_df$datetime_utc)), "to",
as.character(max(test_df$datetime_utc))
)Train: 42067 rows | 2020-01-05 23:00:00 to 2024-10-23 17:00:00
Test: 10517 rows | 2024-10-23 18:00:00 to 2026-01-04 22:00:00
7.0 Pre-Modelling Diagnostics
Before fitting the price model, we check the standard regression assumptions against a preliminary OLS fit to motivate modelling choices.
Show code
diag_fml <- price_de_lu ~ renewable_share + residual_load + net_export_mw +
ttf_eur_mwh + ttf_change + thermal_gen + renewable_ramp +
ssrd_wm2 + temp_centered + temp_squared + is_holiday + season + dow + hour
diag_fit <- lm(diag_fml, data = train_df)7.1 Variance Inflation Factors (VIF)
Show code
car::vif(diag_fit) |>
as.data.frame() |>
rownames_to_column("term") |>
as_tibble() |>
rename(gvif = GVIF, df = Df, gvif_adj = `GVIF^(1/(2*Df))`) |>
mutate(across(where(is.numeric), ~ round(., 2))) |>
arrange(desc(gvif_adj)) |>
knitr::kable()| term | gvif | df | gvif_adj |
|---|---|---|---|
| residual_load | 16.25 | 1 | 4.03 |
| thermal_gen | 10.79 | 1 | 3.28 |
| renewable_share | 8.05 | 1 | 2.84 |
| temp_centered | 7.02 | 1 | 2.65 |
| temp_squared | 4.71 | 1 | 2.17 |
| net_export_mw | 3.57 | 1 | 1.89 |
| ssrd_wm2 | 1.99 | 1 | 1.41 |
| season | 3.95 | 3 | 1.26 |
| renewable_ramp | 1.37 | 1 | 1.17 |
| ttf_eur_mwh | 1.23 | 1 | 1.11 |
| hour | 1.12 | 1 | 1.06 |
| is_holiday | 1.09 | 1 | 1.04 |
| dow | 1.49 | 6 | 1.03 |
| ttf_change | 1.04 | 1 | 1.02 |
7.2 Residuals vs Fitted
Show code
tibble(fitted = fitted(diag_fit), residual = residuals(diag_fit)) |>
ggplot(aes(fitted, residual)) +
geom_hex(bins = 80) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(method = "loess", se = FALSE, colour = "#E63946", linewidth = 0.7) +
scale_fill_viridis_c(option = "mako", trans = "log10", name = "Count") +
labs(title = "Residuals vs Fitted Values (Preliminary OLS)",
x = "Fitted (EUR/MWh)", y = "Residual (EUR/MWh)")
7.3 Q-Q Plot
Show code
tibble(residual = residuals(diag_fit)) |>
ggplot(aes(sample = residual)) +
stat_qq(alpha = 0.05, size = 0.5, colour = "grey40") +
stat_qq_line(colour = "#E63946", linewidth = 0.7) +
labs(title = "Q-Q Plot — OLS Residuals",
x = "Theoretical Quantiles", y = "Sample Quantiles (EUR/MWh)")
7.4 Breusch-Pagan Test
Show code
bp_test <- lmtest::bptest(diag_fit)
cat(
"Breusch-Pagan statistic:", round(bp_test$statistic, 1), "\n",
"p-value:", format.pval(bp_test$p.value, digits = 3), "\n\n"
)Breusch-Pagan statistic: 8072.9
p-value: <2e-16
Show code
if (bp_test$p.value < 0.05) {
cat("Result: heteroskedasticity detected (p < 0.05).\n",
"This is expected for electricity prices and handled with HC-robust standard errors.\n")
}Result: heteroskedasticity detected (p < 0.05).
This is expected for electricity prices and handled with HC-robust standard errors.
The diagnostics reveal two features typical of electricity price data:
- Multicollinearity between renewable share and residual load (mechanically linked via total load), and between thermal generation and residual load (thermal plants fill the residual demand gap). A lasso penalty addresses this by zeroing out redundant terms.
- Heteroskedasticity — residual variance scales with price level. This is a structural feature of electricity markets, not a model failure. HC-robust standard errors (sandwich estimator) ensure valid inference. The model operates in raw EUR/MWh for full interpretability.
These findings motivate the modelling choices in the next section.
8.0 Price Regression
We fit three models and compare them via expanding-window cross-validation:
- Lasso with interactions — interpretable, coefficient-based. The L1 penalty produces a sparse model by zeroing out redundant terms. Hour × season dummies replace sinusoidal encoding for richer temporal resolution. Polynomial residual load and TTF × residual load interactions capture non-linear merit-order effects.
- Boosted tree (XGBoost, mean) — flexible, non-parametric benchmark. Captures non-linearities and higher-order interactions the lasso cannot.
- Quantile XGBoost (q5, q50, q95) — targets specific percentiles, producing a 90% prediction interval that captures tail events the mean models miss. Feeds directly into the Stage 4 VaR/CVaR framework.
The lasso is the primary model for the variance decomposition; the boosted trees serve as performance benchmarks and provide distributional predictions for risk analysis.
8.1 Feature Recipes
Show code
# --- Lasso recipe: hour × season dummies, interactions, normalisation ---
lasso_recipe <- recipe(
price_de_lu ~ renewable_share + residual_load + net_export_mw +
ttf_eur_mwh + ttf_change + thermal_gen + renewable_ramp +
ssrd_wm2 + temp_centered + temp_squared + is_holiday + season + dow + hour,
data = train_df
) |>
# Convert hour to factor for dummy encoding
step_mutate(hour_f = factor(hour)) |>
step_rm(hour) |>
# Polynomial residual load (quadratic — captures merit-order curvature)
step_poly(residual_load, degree = 2, options = list(raw = TRUE)) |>
# Dummy encoding
step_dummy(season, dow, hour_f) |>
# Key interactions
step_interact(terms = ~ renewable_share:starts_with("season_")) |>
step_interact(terms = ~ starts_with("hour_f_"):starts_with("season_")) |>
step_interact(terms = ~ ttf_eur_mwh:residual_load_poly_1) |>
step_interact(terms = ~ net_export_mw:starts_with("season_")) |>
# Normalise all predictors
step_normalize(all_numeric_predictors())
# --- XGBoost recipe: trees handle non-linearities natively ---
xgb_recipe <- recipe(
price_de_lu ~ renewable_share + residual_load + net_export_mw +
ttf_eur_mwh + ttf_change + thermal_gen + renewable_ramp +
ssrd_wm2 + temp_centered + temp_squared + is_holiday + season + dow + hour,
data = train_df
) |>
step_dummy(season, dow)8.2 Model Specifications
Show code
# Pure lasso: mixture = 1 (L1 only), only tune penalty
lasso_spec <- linear_reg(
penalty = tune(),
mixture = 1
) |>
set_engine("glmnet") |>
set_mode("regression")
xgb_spec <- boost_tree(
trees = 500,
tree_depth = 6,
learn_rate = 0.05,
min_n = 20
) |>
set_engine("xgboost") |>
set_mode("regression")
# Quantile XGBoost for tail risk
xgb_q05_spec <- boost_tree(trees = 500, tree_depth = 6, learn_rate = 0.05, min_n = 20) |>
set_engine("xgboost", objective = "reg:quantileerror", quantile_alpha = 0.05) |>
set_mode("regression")
xgb_q50_spec <- boost_tree(trees = 500, tree_depth = 6, learn_rate = 0.05, min_n = 20) |>
set_engine("xgboost", objective = "reg:quantileerror", quantile_alpha = 0.50) |>
set_mode("regression")
xgb_q95_spec <- boost_tree(trees = 500, tree_depth = 6, learn_rate = 0.05, min_n = 20) |>
set_engine("xgboost", objective = "reg:quantileerror", quantile_alpha = 0.95) |>
set_mode("regression")
lasso_wf <- workflow() |> add_recipe(lasso_recipe) |> add_model(lasso_spec)
xgb_wf <- workflow() |> add_recipe(xgb_recipe) |> add_model(xgb_spec)
xgb_q05_wf <- workflow() |> add_recipe(xgb_recipe) |> add_model(xgb_q05_spec)
xgb_q50_wf <- workflow() |> add_recipe(xgb_recipe) |> add_model(xgb_q50_spec)
xgb_q95_wf <- workflow() |> add_recipe(xgb_recipe) |> add_model(xgb_q95_spec)8.3 Expanding-Window Cross-Validation
We use an expanding window (cumulative = TRUE) rather than a fixed-width rolling window. This means each successive fold trains on all prior data, better reflecting how a production model would be updated: you always retrain on the full history available at that point.
Show code
expanding_splits <- rolling_origin(
train_df,
initial = 365 * 24,
assess = 30 * 24,
skip = 30 * 24,
cumulative = TRUE
)
cat("Expanding CV folds:", nrow(expanding_splits))Expanding CV folds: 46
Show code
cv_metrics <- metric_set(rmse, mae, rsq)
# --- Tune lasso penalty ---
lasso_grid <- grid_regular(penalty(range = c(-2, 1)), levels = 20)
lasso_tune <- tune_grid(
lasso_wf,
resamples = expanding_splits,
grid = lasso_grid,
metrics = cv_metrics
)
# --- XGBoost (mean): no tuning ---
xgb_cv <- fit_resamples(xgb_wf, resamples = expanding_splits, metrics = cv_metrics)
# --- Select best lasso penalty ---
best_lasso <- select_best(lasso_tune, metric = "rmse")
cat("Best lasso penalty:", best_lasso$penalty, "\n")Best lasso penalty: 0.01
Show code
lasso_wf_final <- finalize_workflow(lasso_wf, best_lasso)Show code
bind_rows(
collect_metrics(lasso_tune) |>
filter(penalty == best_lasso$penalty) |>
mutate(model = "Lasso"),
collect_metrics(xgb_cv) |>
mutate(model = "XGBoost (mean)")
) |>
select(model, .metric, mean) |>
pivot_wider(names_from = .metric, values_from = mean) |>
knitr::kable(digits = 2)| model | mae | rmse | rsq |
|---|---|---|---|
| Lasso | 24.60 | 30.78 | 0.81 |
| XGBoost (mean) | 23.46 | 30.84 | 0.81 |
8.4 Fit on Full Training Set & Test Evaluation
Show code
lasso_fit <- fit(lasso_wf_final, data = train_df)
xgb_fit <- fit(xgb_wf, data = train_df)
# Quantile XGBoost fits
xgb_q05_fit <- fit(xgb_q05_wf, data = train_df)
xgb_q50_fit <- fit(xgb_q50_wf, data = train_df)
xgb_q95_fit <- fit(xgb_q95_wf, data = train_df)
# Test-set predictions (all EUR/MWh)
test_preds <- test_df |>
select(datetime_utc, price_de_lu, season) |>
mutate(
pred_lasso = predict(lasso_fit, new_data = test_df)$.pred,
pred_xgb = predict(xgb_fit, new_data = test_df)$.pred,
pred_q05 = predict(xgb_q05_fit, new_data = test_df)$.pred,
pred_q50 = predict(xgb_q50_fit, new_data = test_df)$.pred,
pred_q95 = predict(xgb_q95_fit, new_data = test_df)$.pred
)
bind_rows(
test_preds |> metrics(truth = price_de_lu, estimate = pred_lasso) |> mutate(model = "Lasso"),
test_preds |> metrics(truth = price_de_lu, estimate = pred_xgb) |> mutate(model = "XGBoost (mean)"),
test_preds |> metrics(truth = price_de_lu, estimate = pred_q50) |> mutate(model = "XGBoost (median)")
) |>
select(model, .metric, .estimate) |>
pivot_wider(names_from = .metric, values_from = .estimate) |>
knitr::kable(digits = 2, caption = "Test-Set Performance (EUR/MWh)")| model | rmse | rsq | mae |
|---|---|---|---|
| Lasso | 38.98 | 0.63 | 23.74 |
| XGBoost (mean) | 31.72 | 0.72 | 17.62 |
| XGBoost (median) | 32.86 | 0.70 | 16.90 |
8.5 Actual vs Predicted
Show code
test_preds |>
pivot_longer(cols = c(pred_lasso, pred_xgb), names_to = "model",
values_to = "predicted", names_prefix = "pred_") |>
mutate(model = case_match(model, "lasso" ~ "Lasso", "xgb" ~ "XGBoost (mean)")) |>
ggplot(aes(predicted, price_de_lu)) +
geom_hex(bins = 60) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "#E63946") +
facet_wrap(~ model) +
scale_fill_viridis_c(option = "mako", trans = "log10", name = "Count") +
coord_cartesian(xlim = c(-50, 300), ylim = c(-50, 300)) +
labs(title = "Actual vs Predicted — Test Set (EUR/MWh)",
x = "Predicted (EUR/MWh)", y = "Actual (EUR/MWh)")
8.5.1 Quantile Prediction Intervals
The mean-regression models compress predictions toward the centre. The quantile XGBoost models directly target the 5th and 95th percentiles, producing a 90% prediction interval that captures tail events.
Show code
set.seed(42)
interval_sample <- test_preds |>
arrange(pred_q50) |>
mutate(idx = row_number()) |>
slice_sample(n = 3000) |>
arrange(idx)
interval_sample |>
ggplot(aes(x = idx)) +
geom_ribbon(aes(ymin = pred_q05, ymax = pred_q95), fill = "#457B9D", alpha = 0.3) +
geom_line(aes(y = pred_q50), colour = "#457B9D", linewidth = 0.4) +
geom_point(aes(y = price_de_lu), size = 0.3, alpha = 0.4, colour = "grey30") +
labs(
title = "90% Prediction Interval — Quantile XGBoost (EUR/MWh)",
subtitle = "Blue band = q5-q95; line = median; points = actual prices",
x = "Observations (sorted by predicted median)", y = "EUR/MWh"
)
Show code
test_preds |>
mutate(covered = price_de_lu >= pred_q05 & price_de_lu <= pred_q95) |>
summarise(n_obs = n(), coverage = mean(covered), .by = season) |>
add_row(
season = "Overall", n_obs = nrow(test_preds),
coverage = mean(test_preds$price_de_lu >= test_preds$pred_q05 &
test_preds$price_de_lu <= test_preds$pred_q95)
) |>
mutate(coverage = scales::percent(coverage, accuracy = 0.1)) |>
knitr::kable()| season | n_obs | coverage |
|---|---|---|
| Autumn | 3102 | 66.0% |
| Winter | 2999 | 68.5% |
| Spring | 2208 | 75.5% |
| Summer | 2208 | 76.9% |
| Overall | 10517 | 71.0% |
8.6 Residual Time Series
Show code
full_resid <- price_df |>
mutate(
pred = predict(lasso_fit, new_data = price_df)$.pred,
residual = price_de_lu - pred,
date = as_date(datetime_utc),
split = if_else(datetime_utc < split_date, "Train", "Test")
) |>
summarise(
mean_resid = mean(residual, na.rm = TRUE),
sd_resid = sd(residual, na.rm = TRUE),
split = first(split),
.by = date
)
full_resid |>
ggplot(aes(date, mean_resid)) +
geom_ribbon(aes(ymin = mean_resid - sd_resid,
ymax = mean_resid + sd_resid),
fill = "grey80", alpha = 0.5) +
geom_line(colour = "#457B9D", linewidth = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_vline(xintercept = as_date(split_date), linetype = "dotted",
colour = "#E63946", linewidth = 0.5) +
annotate("text", x = as_date(split_date),
y = max(full_resid$mean_resid + full_resid$sd_resid, na.rm = TRUE) * 0.9,
label = "Train / Test split", hjust = -0.05, colour = "#E63946", size = 3.5) +
labs(title = "Daily Mean Residual — Lasso (Full Dataset)",
subtitle = "Ribbon = +/- 1 SD of hourly residuals within each day",
x = NULL, y = "Residual (EUR/MWh)")
8.7 Lasso Coefficients
The lasso zeroes out redundant terms, producing a sparse coefficient set. All numeric predictors are normalised, so each coefficient represents the EUR/MWh price change per 1-SD shift in that predictor — effect sizes are directly comparable across features. Only the top 20 non-zero coefficients are shown.
Show code
lasso_engine <- extract_fit_engine(lasso_fit)
lasso_coefs <- coef(lasso_engine, s = best_lasso$penalty) |>
as.matrix() |>
as.data.frame() |>
rownames_to_column("term") |>
rename(estimate = 2) |>
filter(abs(estimate) > 0, term != "(Intercept)") |>
mutate(
term = str_replace_all(term, "_", " "),
estimate = round(estimate, 2)
) |>
arrange(desc(abs(estimate)))
cat("Total non-zero coefficients:", nrow(lasso_coefs), "\n\n")Total non-zero coefficients: 114
Show code
lasso_coefs |>
slice_head(n = 20) |>
knitr::kable()| term | estimate |
|---|---|
| ttf eur mwh x residual load poly 1 | 104.08 |
| residual load poly 2 | -25.19 |
| thermal gen | 24.51 |
| net export mw | -15.93 |
| temp centered | 14.91 |
| ttf eur mwh | -8.71 |
| renewable share x season Summer | -7.08 |
| temp squared | 6.80 |
| ttf change | -5.77 |
| season Autumn | -5.35 |
| hour f X17 | 3.52 |
| renewable share x season Spring | -3.46 |
| hour f X16 | 3.25 |
| hour f X7 | 3.05 |
| renewable share | 2.73 |
| hour f X8 | 2.60 |
| season Summer | 2.58 |
| renewable share x season Autumn | -2.57 |
| net export mw x season Spring | 2.34 |
| net export mw x season Summer | 2.28 |
8.8 OLS with HC-Robust Standard Errors
For valid statistical inference under heteroskedasticity, we refit OLS using only the predictors selected by the lasso (non-zero coefficients) and report HC3-robust standard errors. This two-stage approach — lasso for feature selection, OLS for inference — is a standard econometric pipeline.
Show code
lasso_prep <- lasso_recipe |> prep(training = train_df)
lasso_baked_train <- bake(lasso_prep, new_data = train_df)
# Extract lasso-selected terms (non-zero at best penalty)
selected_terms <- coef(lasso_engine, s = best_lasso$penalty) |>
as.matrix() |>
as.data.frame() |>
rownames_to_column("term") |>
rename(estimate = 2) |>
filter(abs(estimate) > 0, term != "(Intercept)") |>
pull(term)
cat("Lasso selected", length(selected_terms), "of",
ncol(lasso_baked_train) - 1, "candidate features\n\n")Lasso selected 114 of 120 candidate features
Show code
# Refit OLS on selected features only
ols_data <- lasso_baked_train |> select(price_de_lu, all_of(selected_terms))
ols_robust_fit <- lm(price_de_lu ~ ., data = ols_data)
# HC3-robust inference
robust_coefs <- lmtest::coeftest(ols_robust_fit, vcov. = sandwich::vcovHC(ols_robust_fit, type = "HC3"))
robust_coefs |>
broom::tidy() |>
filter(term != "(Intercept)") |>
mutate(
term = str_replace_all(term, "_", " "),
across(c(estimate, std.error, statistic), ~ round(., 2)),
p.value = scales::pvalue(p.value)
) |>
arrange(desc(abs(estimate))) |>
slice_head(n = 20) |>
knitr::kable()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| ttf eur mwh x residual load poly 1 | 105.22 | 0.89 | 117.57 | <0.001 |
| residual load poly 2 | -26.43 | 0.53 | -49.78 | <0.001 |
| thermal gen | 25.73 | 0.47 | 54.21 | <0.001 |
| net export mw | -18.01 | 0.38 | -47.83 | <0.001 |
| temp centered | 15.43 | 0.53 | 29.19 | <0.001 |
| ttf eur mwh | -9.95 | 0.85 | -11.74 | <0.001 |
| renewable share x season Summer | -9.35 | 0.53 | -17.80 | <0.001 |
| temp squared | 7.44 | 0.42 | 17.59 | <0.001 |
| ttf change | -5.88 | 0.31 | -18.69 | <0.001 |
| renewable share x season Spring | -5.67 | 0.50 | -11.27 | <0.001 |
| renewable share | 5.17 | 0.55 | 9.40 | <0.001 |
| season Spring | 5.10 | 0.71 | 7.18 | <0.001 |
| season Summer | 4.87 | 0.75 | 6.50 | <0.001 |
| hour f X17 | 4.09 | 0.33 | 12.39 | <0.001 |
| renewable share x season Autumn | -4.09 | 0.41 | -10.01 | <0.001 |
| season Autumn | -3.94 | 0.57 | -6.90 | <0.001 |
| hour f X16 | 3.93 | 0.34 | 11.60 | <0.001 |
| hour f X7 | 3.68 | 0.34 | 10.79 | <0.001 |
| hour f X8 | 3.32 | 0.34 | 9.86 | <0.001 |
| net export mw x season Summer | 3.31 | 0.25 | 13.38 | <0.001 |
8.9 Post-Crisis Elastic Net (2023-2025)
The full-sample model is dominated by the 2022 energy crisis, which compresses the explanatory power of all non-gas features. Here we refit the lasso on the post-crisis period only (2023-2025), where gas prices are stable and the merit-order drivers that matter for forward-looking hedging are more clearly identified. We compare test-set performance with the full-sample model to assess whether a regime-specific model improves out-of-sample accuracy in the current market environment.
Show code
price_post <- price_df |>
filter(datetime_utc >= as.POSIXct("2023-01-01", tz = "UTC"))
# 80/20 temporal split within the post-crisis period
split_date_post <- quantile(as.numeric(price_post$datetime_utc), 0.8) |>
as.POSIXct(origin = "1970-01-01", tz = "UTC")
train_post <- price_post |> filter(datetime_utc < split_date_post)
test_post <- price_post |> filter(datetime_utc >= split_date_post)
cat(
"Post-crisis train:", nrow(train_post), "rows |",
as.character(min(train_post$datetime_utc)), "to",
as.character(max(train_post$datetime_utc)), "\n",
"Post-crisis test: ", nrow(test_post), "rows |",
as.character(min(test_post$datetime_utc)), "to",
as.character(max(test_post$datetime_utc))
)Post-crisis train: 21119 rows | 2023-01-01 to 2025-05-29 22:00:00
Post-crisis test: 5280 rows | 2025-05-29 23:00:00 to 2026-01-04 22:00:00
Show code
# Build recipe on post-crisis training data
lasso_recipe_post <- recipe(
price_de_lu ~ renewable_share + residual_load + net_export_mw +
ttf_eur_mwh + ttf_change + thermal_gen + renewable_ramp +
ssrd_wm2 + temp_centered + temp_squared + is_holiday + season + dow + hour,
data = train_post
) |>
step_mutate(hour_f = factor(hour)) |>
step_rm(hour) |>
step_poly(residual_load, degree = 2, options = list(raw = TRUE)) |>
step_dummy(season, dow, hour_f) |>
step_interact(terms = ~ renewable_share:starts_with("season_")) |>
step_interact(terms = ~ starts_with("hour_f_"):starts_with("season_")) |>
step_interact(terms = ~ ttf_eur_mwh:residual_load_poly_1) |>
step_interact(terms = ~ net_export_mw:starts_with("season_")) |>
step_normalize(all_numeric_predictors())
# Tune penalty via expanding CV on post-crisis data
expanding_post <- rolling_origin(
train_post,
initial = 180 * 24,
assess = 30 * 24,
skip = 30 * 24,
cumulative = TRUE
)
lasso_wf_post <- workflow() |>
add_recipe(lasso_recipe_post) |>
add_model(lasso_spec)
lasso_tune_post <- tune_grid(
lasso_wf_post,
resamples = expanding_post,
grid = lasso_grid,
metrics = cv_metrics
)
best_lasso_post <- select_best(lasso_tune_post, metric = "rmse")
cat("Best post-crisis lasso penalty:", best_lasso_post$penalty, "\n")Best post-crisis lasso penalty: 0.08858668
Show code
lasso_wf_post_final <- finalize_workflow(lasso_wf_post, best_lasso_post)
lasso_fit_post <- fit(lasso_wf_post_final, data = train_post)Show code
# Both models predict on the same post-crisis test set
post_preds <- test_post |>
select(datetime_utc, price_de_lu, season) |>
mutate(
pred_full = predict(lasso_fit, new_data = test_post)$.pred,
pred_post = predict(lasso_fit_post, new_data = test_post)$.pred
)
bind_rows(
post_preds |> metrics(truth = price_de_lu, estimate = pred_full) |>
mutate(model = "Full-sample lasso"),
post_preds |> metrics(truth = price_de_lu, estimate = pred_post) |>
mutate(model = "Post-crisis lasso")
) |>
select(model, .metric, .estimate) |>
pivot_wider(names_from = .metric, values_from = .estimate) |>
knitr::kable(digits = 2)| model | rmse | rsq | mae |
|---|---|---|---|
| Full-sample lasso | 36.24 | 0.65 | 26.33 |
| Post-crisis lasso | 27.70 | 0.75 | 18.93 |
If the post-crisis lasso outperforms the full-sample model on this test set, it suggests the 2022 crisis period introduced coefficient bias that hurts forecasting in the current regime. This is the model a hedging desk would deploy today — trained only on the market environment that reflects current gas price levels and renewable capacity. Both models are serialised for use in Stage 4 risk quantification, allowing side-by-side VaR comparisons.
Show code
# --- Serialise all models for Stage 4 ---
dir.create("models", showWarnings = FALSE)
saveRDS(lasso_fit, "models/price_lasso_DE-LU.rds")
saveRDS(lasso_fit_post, "models/price_lasso_post_DE-LU.rds")
saveRDS(xgb_q05_fit, "models/price_xgb_q05_DE-LU.rds")
saveRDS(xgb_q95_fit, "models/price_xgb_q95_DE-LU.rds")
cat(
"Models saved to models/:\n",
" - price_lasso_DE-LU.rds (full-sample lasso)\n",
" - price_lasso_post_DE-LU.rds (post-crisis lasso)\n",
" - price_xgb_q05_DE-LU.rds (quantile XGB q05)\n",
" - price_xgb_q95_DE-LU.rds (quantile XGB q95)\n"
)Models saved to models/:
- price_lasso_DE-LU.rds (full-sample lasso)
- price_lasso_post_DE-LU.rds (post-crisis lasso)
- price_xgb_q05_DE-LU.rds (quantile XGB q05)
- price_xgb_q95_DE-LU.rds (quantile XGB q95)
9.0 Variance Decomposition
This is the centrepiece exhibit. We quantify how much of day-ahead price variance each driver explains, and show how the attribution shifts by season.
The decomposition is presented in two panels. First, the full sample (2020-2025) provides context: the 2021-22 gas crisis dominates overall variance, with TTF swamping all other drivers. Second, the post-crisis subset (2023-2025) isolates the stable gas-price regime where renewable variability, residual load, and temporal patterns emerge as the primary merit-order drivers. This second panel reflects the market environment a hedging desk faces today.
The expanded feature set is decomposed into eight groups — up from five in the original specification. The additional groups (weather, thermal dispatch, renewable ramp) isolate channels that were previously absorbed into the renewable share and residual load terms, producing a more granular attribution.
9.1 Full Sample — Partial R-squared by Season
Show code
# Helper function for partial R-squared decomposition (8 feature groups)
compute_partial_r2 <- function(df, response = "price_de_lu") {
# Create hour dummies within-season (factor, not sinusoidal)
df <- df |>
mutate(hour_f = factor(hour))
full_fml <- reformulate(
c("renewable_share", "residual_load", "net_export_mw",
"ttf_eur_mwh", "ttf_change",
"ssrd_wm2", "temp_centered", "temp_squared",
"thermal_gen",
"renewable_ramp",
"hour_f", "dow", "is_holiday"),
response = response
)
feature_groups <- list(
"Renewable share" = "renewable_share",
"Residual load" = "residual_load",
"Net exports" = "net_export_mw",
"Gas price (TTF)" = c("ttf_eur_mwh", "ttf_change"),
"Weather (solar, temperature)" = c("ssrd_wm2", "temp_centered", "temp_squared"),
"Thermal dispatch" = "thermal_gen",
"Renewable ramp" = "renewable_ramp",
"Temporal (hour x season + dow + holiday)" = c("hour_f", "dow", "is_holiday")
)
map_dfr(levels(df$season), function(s) {
df_s <- df |> filter(season == s)
full_fit <- lm(full_fml, data = df_s)
full_r2 <- summary(full_fit)$r.squared
map_dfr(names(feature_groups), function(g) {
drop_terms <- feature_groups[[g]]
keep_terms <- setdiff(attr(terms(full_fml), "term.labels"), drop_terms)
reduced_fml <- reformulate(keep_terms, response = response)
reduced_r2 <- summary(lm(reduced_fml, data = df_s))$r.squared
tibble(season = s, component = g, partial_r2 = full_r2 - reduced_r2)
}) |>
add_row(season = s, component = "Full model R-sq", partial_r2 = full_r2)
})
}
partial_r2_full <- compute_partial_r2(price_df)Show code
partial_r2_full |>
filter(component != "Full model R-sq") |>
mutate(
season = factor(season, levels = c("Winter", "Spring", "Summer", "Autumn")),
component = fct_reorder(component, partial_r2, .fun = mean, .desc = TRUE)
) |>
ggplot(aes(component, partial_r2, fill = component)) +
geom_col(width = 0.7) +
facet_wrap(~ season) +
coord_flip() +
scale_fill_brewer(palette = "Set2", guide = "none") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "Variance Attribution — Full Sample (2020-2025)",
subtitle = "Drop in model R-squared when each feature group is excluded",
x = NULL, y = "Partial R-squared"
)
9.2 Post-Crisis — Partial R-squared by Season
With TTF stabilised in the 30-55 EUR/MWh range since mid-2022, the merit-order drivers that matter for forward-looking hedging become visible.
Show code
price_post_r2 <- price_df |> filter(datetime_utc >= as.POSIXct("2023-01-01", tz = "UTC"))
cat("Post-crisis rows:", nrow(price_post_r2), "\n")Post-crisis rows: 26399
Show code
partial_r2_post <- compute_partial_r2(price_post_r2)Show code
partial_r2_post |>
filter(component != "Full model R-sq") |>
mutate(
season = factor(season, levels = c("Winter", "Spring", "Summer", "Autumn")),
component = fct_reorder(component, partial_r2, .fun = mean, .desc = TRUE)
) |>
ggplot(aes(component, partial_r2, fill = component)) +
geom_col(width = 0.7) +
facet_wrap(~ season) +
coord_flip() +
scale_fill_brewer(palette = "Set2", guide = "none") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "Variance Attribution — Post-Crisis (2023-2025)",
subtitle = "Drop in model R-squared when each feature group is excluded",
x = NULL, y = "Partial R-squared"
)
Show code
bind_rows(
partial_r2_full |> mutate(period = "Full (2020-2025)"),
partial_r2_post |> mutate(period = "Post-crisis (2023-2025)")
) |>
filter(component != "Full model R-sq") |>
mutate(partial_r2 = round(partial_r2, 4)) |>
pivot_wider(names_from = season, values_from = partial_r2) |>
arrange(period, desc(Winter)) |>
knitr::kable()| component | period | Winter | Spring | Summer | Autumn |
|---|---|---|---|---|---|
| Gas price (TTF) | Full (2020-2025) | 0.4362 | 0.3884 | 0.5476 | 0.4864 |
| Net exports | Full (2020-2025) | 0.0087 | 0.0000 | 0.0009 | 0.0026 |
| Temporal (hour x season + dow + holiday) | Full (2020-2025) | 0.0058 | 0.0082 | 0.0073 | 0.0079 |
| Residual load | Full (2020-2025) | 0.0036 | 0.0006 | 0.0003 | 0.0043 |
| Weather (solar, temperature) | Full (2020-2025) | 0.0030 | 0.0003 | 0.0024 | 0.0058 |
| Renewable share | Full (2020-2025) | 0.0002 | 0.0032 | 0.0003 | 0.0001 |
| Thermal dispatch | Full (2020-2025) | 0.0000 | 0.0000 | 0.0000 | 0.0007 |
| Renewable ramp | Full (2020-2025) | 0.0000 | 0.0000 | 0.0000 | 0.0001 |
| Gas price (TTF) | Post-crisis (2023-2025) | 0.0607 | 0.0424 | 0.0011 | 0.0002 |
| Residual load | Post-crisis (2023-2025) | 0.0126 | 0.0038 | 0.0103 | 0.0191 |
| Weather (solar, temperature) | Post-crisis (2023-2025) | 0.0070 | 0.0010 | 0.0049 | 0.0043 |
| Temporal (hour x season + dow + holiday) | Post-crisis (2023-2025) | 0.0056 | 0.0213 | 0.0146 | 0.0219 |
| Thermal dispatch | Post-crisis (2023-2025) | 0.0006 | 0.0011 | 0.0003 | 0.0012 |
| Renewable share | Post-crisis (2023-2025) | 0.0001 | 0.0070 | 0.0000 | 0.0002 |
| Net exports | Post-crisis (2023-2025) | 0.0001 | 0.0017 | 0.0002 | 0.0021 |
| Renewable ramp | Post-crisis (2023-2025) | 0.0000 | 0.0003 | 0.0002 | 0.0000 |
9.3 Permutation Importance (Robustness Check)
As a robustness check, we compute permutation-based feature importance using the iml package. This approach is model-agnostic and doesn’t depend on the sequential ordering of terms.
Partial R-squared uses unpenalised OLS (fitted per season) for clean additive decomposition — it answers “how much variance is associated with each driver?” The permutation importance uses the fitted lasso — it answers “how much does this specific model rely on each feature?” If the two methods agree, that is strong evidence. If they diverge on a specific feature, the divergence flags collinearity where the lasso made an arbitrary selection between correlated features.
Show code
lasso_baked_test <- lasso_recipe |>
prep(training = train_df) |>
bake(new_data = test_df)
X_test <- lasso_baked_test |> select(-price_de_lu)
y_test <- lasso_baked_test$price_de_lu
lasso_engine_perm <- extract_fit_engine(lasso_fit)
lasso_predict <- function(model, newdata) {
predict(model, as.matrix(newdata), s = best_lasso$penalty)[, 1]
}
predictor_obj <- Predictor$new(
model = lasso_engine_perm,
data = X_test,
y = y_test,
predict.function = lasso_predict
)
set.seed(42)
feat_imp <- FeatureImp$new(predictor_obj, loss = "mse", n.repetitions = 10)Show code
feat_imp$results |>
as_tibble() |>
slice_max(importance, n = 25) |>
mutate(feature = str_replace_all(feature, "_", " ") |> fct_reorder(importance)) |>
ggplot(aes(importance, feature)) +
geom_col(fill = "#457B9D", width = 0.6) +
geom_vline(xintercept = 1, linetype = "dashed", colour = "grey50") +
labs(
title = "Permutation Feature Importance (Lasso on Test Set)",
subtitle = "Top 25 features — ratio of permuted MSE to baseline MSE",
x = "Importance (MSE ratio)", y = NULL
)
10.0 Key Findings
- Fuel cost shocks dominated European electricity price variance during 2020-2022. The TTF gas price alone explains 50-80% of seasonal price variance in the full sample — the energy crisis was the single largest source of price risk.
- In the post-crisis market (2023-2025), renewable variability and residual load emerge as the primary drivers. With gas prices stabilised, the merit-order effects that shape day-ahead pricing become visible.
- Renewable share dominates summer price variance in the post-crisis period, when solar generation swings prices between midday troughs and evening peaks.
- Residual load is most important in shoulder seasons (spring, autumn), when neither renewable nor gas effects dominate.
- Net exports have a measurable but secondary effect, consistent with Germany’s central position in the CWE interconnected market. The net export × season interaction reveals that exports tighten supply most in winter when neighbouring zones draw on German baseload.
- Thermal dispatch captures information beyond residual load. The combined output of biomass, gas, hard coal, and lignite plants reflects generator-level commitment decisions (must-run, CHP obligations) that residual load alone does not capture.
- Weather variables (solar irradiance, centred temperature) add explanatory power on top of renewable share, because they capture the demand-side weather channel (heating and cooling load) independently from the supply-side channel already embedded in renewable generation. Using centred temperature rather than HDD/CDD avoids seasonal zero-variance issues.
- Renewable ramp rates contribute to intra-day price volatility, particularly during evening ramp-down of solar in summer and rapid wind changes in autumn. The hour-over-hour change in renewable share explains a small but distinct fraction of variance that the level of renewable share misses.
- Temporal patterns (hour × season dummies, day-of-week, holidays) capture structural demand cycles — contributing consistently across all seasons. German public holidays shift demand patterns independently of day-of-week effects.
- Heteroskedasticity is a structural feature of electricity markets, not a model failure. All results are reported in raw EUR/MWh with HC-robust standard errors for valid inference. The ADF test confirms the price series is stationary, validating regression on levels.
- The post-crisis lasso (2023-2025) provides a regime-specific benchmark for the current market. Comparing its test-set performance against the full-sample model reveals whether the 2022 crisis period introduces coefficient bias that hurts forecasting under stable gas prices.
- Quantile XGBoost provides distributional predictions for risk analysis, producing a 90% prediction interval that brackets extreme price events. These feed directly into the Stage 4 VaR/CVaR framework alongside the post-crisis lasso.
These findings inform the risk quantification in the next section: the seasonal attribution structure means that a hedging strategy must account for which risk factor dominates in the delivery period. In today’s market, a summer forward position is primarily a renewable variability bet; shoulder-season contracts depend on residual load dynamics; weather-driven demand (temperature extremes) adds an independent risk channel year-round; and the ever-present tail risk of another fuel cost shock requires monitoring through the quantile framework.
SMARD data: Bundesnetzagentur | SMARD.de, CC BY 4.0. ENTSO-E Transparency Platform data used per ENTSO-E terms of service. TTF gas price data sourced from Yahoo Finance (ICE Dutch TTF front-month futures). Copernicus ERA5 reanalysis: Hersbach et al. (2020), doi:10.24381/cds.adbb2d47.