Weather → Generation

Cross-zone weather profiles and DE-LU capacity factor models

1.0 Overview

This page has two parts. First, we compare ERA5 reanalysis weather profiles across all available bidding zones — wind speed distributions, solar irradiance gradients, and seasonal patterns — to build intuition for why renewable generation (and its price impact) differs across Central Western Europe and the Nordics.

Second, we fit capacity-factor models for DE-LU, where SMARD provides the most granular generation breakdown by technology. These models link ERA5 weather variables to observed wind onshore, wind offshore, and solar PV output, and feed directly into the price decomposition (Stage 2) and risk simulation (Stage 3).

Modelling approach. For each technology we:

Join hourly ERA5 weather to observed SMARD generation and compute capacity factors using BNetzA installed-capacity milestones.
Fit a tidymodels workflow — polynomial weather terms plus month/hour fixed effects (with month × irradiance interactions for solar) — on a temporal 80/20 train/test split.
Evaluate with RMSE, MAE, and R² on the held-out test period.

2.0 Data loading

Show code

# ── Load all available ERA5 zone parquets ────────────────────────────
era5_files <- list.files("data/processed/era5",
                         pattern = "\\.parquet$",
                         full.names = TRUE)

era5_all <- map(era5_files, \(f) {
  read_parquet(f) |>
    mutate(zone = str_remove(basename(f), "\\.parquet$"))
}) |>
  list_rbind()

zones_available <- sort(unique(era5_all$zone))
cat("ERA5 zones loaded:", paste(zones_available, collapse = ", "), "\n")

ERA5 zones loaded: AT, BE, DE-LU, DK, FI, FR, NL, NO, SE

Show code

cat("Total ERA5 rows:", nrow(era5_all), "\n")

Total ERA5 rows: 473472

Show code

# ── Load DE-LU ERA5 and SMARD separately for modelling ──────────────
era5 <- era5_all |> filter(zone == zone_file) |> select(-zone)

smard_path <- list.files("data/processed/smard",
                         pattern = glue("smard_{zone_file}"),
                         full.names = TRUE)
smard <- read_parquet(smard_path)

cat("\nDE-LU ERA5 rows:", nrow(era5),
    " | range:", format(min(era5$datetime_utc)), "–",
    format(max(era5$datetime_utc)), "\n")


DE-LU ERA5 rows: 52608  | range: 2019-12-31 18:00:00 – 2025-12-31 17:00:00

Show code

cat("DE-LU SMARD rows:", nrow(smard),
    " | range:", format(min(smard$datetime_utc)), "–",
    format(max(smard$datetime_utc)), "\n")

DE-LU SMARD rows: 52584  | range: 2020-01-05 23:00:00 – 2026-01-04 22:00:00

2.1 Derived weather features: centred temperature

Temperature drives electricity demand through both heating (cold → high load) and cooling (heat → air conditioning). Rather than separate HDD and CDD variables — which have zero variance in opposite seasons (CDD is always zero in German winters, HDD near zero in summer) — we use a centred transform that has variance year-round:

temp_centered = temperature_2m - 18: negative in heating conditions, positive in cooling conditions.
temp_squared = temp_centered²: captures the U-shaped relationship where both extreme cold and extreme heat drive prices up.

This avoids the collinearity issues that HDD/CDD introduce in per-season decompositions while preserving the same demand-signal information.

Show code

era5_all <- era5_all |>
  mutate(
    temp_centered = temperature_2m - 18,
    temp_squared  = temp_centered^2
  )

# Update DE-LU subset
era5 <- era5_all |> filter(zone == zone_file) |> select(-zone)

# Summary check
era5 |>
  summarise(
    `temp_centered mean` = round(mean(temp_centered), 2),
    `temp_centered min`  = round(min(temp_centered), 1),
    `temp_centered max`  = round(max(temp_centered), 1),
    `temp_squared mean`  = round(mean(temp_squared), 1),
    `Hours T < 18°C (heating)` = sum(temperature_2m < 18),
    `Hours T > 18°C (cooling)` = sum(temperature_2m > 18)
  ) |>
  knitr::kable(caption = "Centred temperature summary (DE-LU, 18 °C base)")

Centred temperature summary (DE-LU, 18 °C base)
temp_centered mean	temp_centered min	temp_centered max	temp_squared mean	Hours T < 18°C (heating)	Hours T > 18°C (cooling)
-7.61	-30.4	16.5	112.4	43741	8867

Show code

era5 |>
  mutate(
    season = case_when(
      month(datetime_utc) %in% c(12, 1, 2) ~ "Winter",
      month(datetime_utc) %in% c(3, 4, 5)  ~ "Spring",
      month(datetime_utc) %in% c(6, 7, 8)  ~ "Summer",
      TRUE                                  ~ "Autumn"
    ) |> factor(levels = c("Winter", "Spring", "Summer", "Autumn"))
  ) |>
  pivot_longer(c(temp_centered, temp_squared), names_to = "variable", values_to = "value") |>
  mutate(variable = recode(variable,
    temp_centered = "temp_centered (T - 18°C)",
    temp_squared  = "temp_squared (T - 18°C)²")) |>
  ggplot(aes(x = value, fill = season)) +
  geom_histogram(bins = 50, alpha = 0.7, position = "identity") +
  facet_wrap(~ variable, scales = "free") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title   = "Centred Temperature Features by Season",
    subtitle = "Both features have variance in every season — no zero-variance issues",
    x = NULL, y = "Count", fill = NULL,
    caption = "Source: Copernicus ERA5 reanalysis"
  )

Figure 1: Seasonal distribution of centred temperature (DE-LU). The linear term is negative in heating conditions and positive in cooling conditions; the squared term captures demand extremes in both directions. Both features have non-zero variance in every season.

3.0 Cross-zone weather profiles

Before narrowing to DE-LU for capacity-factor modelling, we compare the weather resource across all zones for which ERA5 spatial averages are available. These profiles explain much of the cross-zone variation in renewable generation — and ultimately in price dynamics — explored in Stage 2.

3.1 Wind speed distributions

Hub-height (100 m) wind speed distributions differ markedly across bidding zones, reflecting geographic exposure to Atlantic weather systems, coastal effects, and orography. Zones with higher mean wind speeds and heavier right tails support higher average capacity factors but also contribute more generation variability.

Show code

library(plotly)

wind_density <- era5_all |>
  group_by(zone) |>
  group_map(\(df, grp) {
    d <- density(df$wind_speed_100m, from = 0, n = 256)
    tibble(zone = grp$zone, wind_speed = d$x, density = d$y)
  }) |>
  list_rbind()

plot_ly(wind_density, x = ~wind_speed, y = ~density,
        color = ~zone, type = "scatter", mode = "lines",
        fill = "tozeroy", opacity = 0.35,
        hovertemplate = paste0(
          "<b>%{meta}</b><br>",
          "Wind speed: %{x:.1f} m/s<br>",
          "Density: %{y:.4f}<extra></extra>"
        ),
        meta = ~zone) |>
  layout(
    title = list(
      text = "100m wind speed distributions by bidding zone",
      font = list(size = 14)
    ),
    xaxis = list(title = "Wind speed (m/s)"),
    yaxis = list(title = "Density"),
    legend = list(title = list(text = "Zone")),
    annotations = list(
      list(text = "Source: Copernicus ERA5 reanalysis",
           xref = "paper", yref = "paper", x = 1, y = -0.12,
           showarrow = FALSE, font = list(size = 8, color = "grey50"),
           xanchor = "right")
    )
  )

3.2 Seasonal wind profiles

Wind resource shows strong seasonality across all zones — winter months are consistently windier — but the magnitude of the seasonal swing varies geographically. Zones with a large winter-to-summer ratio face more pronounced seasonal generation risk.

Show code

era5_all |>
  mutate(
    season = case_when(
      month(datetime_utc) %in% c(12, 1, 2) ~ "Winter",
      month(datetime_utc) %in% c(3, 4, 5)  ~ "Spring",
      month(datetime_utc) %in% c(6, 7, 8)  ~ "Summer",
      TRUE                                  ~ "Autumn"
    ),
    season = factor(season, levels = c("Winter", "Spring", "Summer", "Autumn"))
  ) |>
  group_by(zone, season) |>
  summarise(
    mean_wind = mean(wind_speed_100m, na.rm = TRUE),
    .groups   = "drop"
  ) |>
  mutate(zone = fct_reorder(zone, mean_wind, .fun = mean, .desc = TRUE)) |>
  ggplot(aes(x = zone, y = mean_wind, fill = season)) +
  geom_col(position = "dodge", alpha = 0.85) +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Mean 100 m wind speed by zone and season",
    subtitle = "Winter dominance is universal but the seasonal swing varies by zone",
    x        = NULL,
    y        = "Mean wind speed (m/s)",
    fill     = NULL,
    caption  = "Source: Copernicus ERA5 reanalysis"
  )

3.3 Solar irradiance gradient

Surface solar irradiance falls with latitude — southern zones receive materially more annual radiation than Nordic zones. This gradient drives differences in solar PV economics and in the relative contribution of solar variability to price risk across the interconnected market.

Show code

library(plotly)

solar_diurnal <- era5_all |>
  mutate(
    hour_utc = hour(datetime_utc),
    season = case_when(
      month(datetime_utc) %in% c(12, 1, 2) ~ "Winter",
      month(datetime_utc) %in% c(3, 4, 5)  ~ "Spring",
      month(datetime_utc) %in% c(6, 7, 8)  ~ "Summer",
      TRUE                                  ~ "Autumn"
    ),
    season = factor(season, levels = c("Winter", "Spring", "Summer", "Autumn"))
  ) |>
  group_by(zone, season, hour_utc) |>
  summarise(mean_ssrd = round(mean(ssrd_wm2, na.rm = TRUE), 1),
            .groups = "drop")

make_panel <- function(df, season_label, show_legend = FALSE) {
  plot_ly(df |> filter(season == season_label),
          x = ~hour_utc, y = ~mean_ssrd, color = ~zone,
          type = "scatter", mode = "lines",
          showlegend = show_legend,
          hovertemplate = paste0(
            "<b>%{meta}</b><br>",
            "Hour: %{x} UTC<br>",
            "Irradiance: %{y:.1f} W/m²<extra></extra>"
          ),
          meta = ~zone) |>
    layout(
      xaxis = list(title = if (season_label %in% c("Summer", "Autumn"))
                             "Hour (UTC)" else ""),
      yaxis = list(title = if (season_label %in% c("Winter", "Summer"))
                             "W/m²" else ""),
      annotations = list(
        list(text = paste0("<b>", season_label, "</b>"),
             xref = "paper", yref = "paper",
             x = 0.5, y = 1.05, showarrow = FALSE,
             font = list(size = 12))
      )
    )
}

subplot(
  make_panel(solar_diurnal, "Winter"),
  make_panel(solar_diurnal, "Spring", show_legend = TRUE),
  make_panel(solar_diurnal, "Summer"),
  make_panel(solar_diurnal, "Autumn"),
  nrows = 2, shareX = TRUE, shareY = TRUE,
  titleX = TRUE, titleY = TRUE
) |>
  layout(
    title = list(
      text = "Mean diurnal solar irradiance by zone and season",
      font = list(size = 14)
    ),
    legend = list(title = list(text = "Zone")),
    annotations = list(
      list(text = "Source: Copernicus ERA5 reanalysis",
           xref = "paper", yref = "paper", x = 1, y = -0.08,
           showarrow = FALSE, font = list(size = 8, color = "grey50"),
           xanchor = "right")
    )
  )

3.4 Cross-zone summary statistics

Show code

era5_all |>
  group_by(zone) |>
  summarise(
    `Wind 100m mean (m/s)` = round(mean(wind_speed_100m, na.rm = TRUE), 2),
    `Wind 100m sd (m/s)`   = round(sd(wind_speed_100m, na.rm = TRUE), 2),
    `Solar mean (W/m²)`    = round(mean(ssrd_wm2, na.rm = TRUE), 1),
    `Solar daylight mean (W/m²)` = round(
      mean(ssrd_wm2[ssrd_wm2 > 0], na.rm = TRUE), 1
    ),
    `Temp mean (°C)`       = round(mean(temperature_2m, na.rm = TRUE), 1),
    `Hours`                = n(),
    .groups = "drop"
  ) |>
  knitr::kable(caption = "ERA5 weather summary statistics by bidding zone")

Table 1: ERA5 weather summary statistics by bidding zone

zone	Wind 100m mean (m/s)	Wind 100m sd (m/s)	Solar mean (W/m²)	Solar daylight mean (W/m²)	Temp mean (°C)	Hours
AT	3.35	1.31	84.4	149.4	8.0	52608
BE	6.54	2.93	76.3	138.5	11.2	52608
DE-LU	5.60	2.17	75.9	132.2	10.4	52608
DK	7.69	3.16	71.1	126.0	9.7	52608
FI	5.89	1.91	57.1	91.7	3.8	52608
FR	5.69	2.14	86.0	147.9	12.1	52608
NL	7.03	3.10	74.4	134.1	11.4	52608
NO	4.81	1.53	56.4	87.7	3.3	52608
SE	5.78	1.79	60.6	96.6	4.6	52608

The remainder of this page focuses on DE-LU, where SMARD provides hourly generation by technology (wind onshore, wind offshore, solar PV) and BNetzA publishes installed-capacity milestones — the data needed to compute capacity factors and fit weather → generation models.

4.0 DE-LU weather profiles

4.1 Distributions

Show code

p_wind100 <- ggplot(era5, aes(x = wind_speed_100m)) +
  geom_histogram(bins = 80, fill = pal["wind_onshore"], alpha = 1.7) +
  labs(title = "100 m wind speed", x = "m/s", y = "Count")

p_wind10 <- ggplot(era5, aes(x = wind_speed_10m)) +
  geom_histogram(bins = 80, fill = pal["wind_offshore"], alpha = 1.7) +
  labs(title = "10 m wind speed", x = "m/s", y = "Count")

p_temp <- ggplot(era5, aes(x = temperature_2m)) +
  geom_histogram(bins = 80, fill = "grey50", alpha = 1.7) +
  labs(title = "2 m temperature", x = "°C", y = "Count")

p_ssrd <- era5 |>
  filter(ssrd_wm2 > 0) |>
  ggplot(aes(x = ssrd_wm2)) +
  geom_histogram(bins = 80, fill = pal["solar"], alpha = 1.7) +
  labs(title = "Surface solar radiation (daylight only)",
       x = expression("W/m"^2), y = "Count")

(p_wind100 + p_wind10) / (p_temp + p_ssrd) +
  plot_annotation(
    title   = glue("ERA5 weather variable distributions — {zone_file}"),
    caption = "Source: Copernicus ERA5 reanalysis"
  )

4.2 Seasonal diurnal profiles

Show code

era5 |>
  mutate(
    hour_utc = hour(datetime_utc),
    season = case_when(
      month(datetime_utc) %in% c(12, 1, 2) ~ "Winter",
      month(datetime_utc) %in% c(3, 4, 5)  ~ "Spring",
      month(datetime_utc) %in% c(6, 7, 8)  ~ "Summer",
      TRUE                                  ~ "Autumn"
    ),
    season = factor(season, levels = c("Winter", "Spring", "Summer", "Autumn"))
  ) |>
  group_by(season, hour_utc) |>
  summarise(
    wind_100m = mean(wind_speed_100m, na.rm = TRUE),
    ssrd      = mean(ssrd_wm2, na.rm = TRUE),
    .groups   = "drop"
  ) |>
  pivot_longer(c(wind_100m, ssrd), names_to = "variable", values_to = "value") |>
  mutate(variable = recode(variable,
                           wind_100m = "Wind speed 100m (m/s)",
                           ssrd      = "Solar radiation (W/m²)")) |>
  ggplot(aes(x = hour_utc, y = value, colour = season)) +
  geom_line(linewidth = 0.9) +
  geom_point(size = 1.2) +
  facet_wrap(~ variable, scales = "free_y") +
  scale_colour_brewer(palette = "Set2") +
  labs(
    title   = glue("Mean diurnal profiles by season — {zone_file}"),
    x       = "Hour (UTC)",
    y       = NULL,
    colour  = NULL,
    caption = "Source: Copernicus ERA5 reanalysis"
  )

4.3 Timestamp alignment check

ERA5 ssrd is originally an accumulated field (J/m² per forecast step). Before modelling we verify that the converted instantaneous values align temporally with SMARD solar generation — a 1-hour offset would produce misleading scatter plots and biased models.

Show code

era5 |>
  inner_join(smard, by = "datetime_utc") |>
  filter(month(datetime_utc) == 6) |>
  mutate(hour_utc = hour(datetime_utc)) |>
  filter(hour_utc %in% 4:20) |>
  group_by(hour_utc) |>
  summarise(
    mean_ssrd  = mean(ssrd_wm2, na.rm = TRUE),
    mean_solar = mean(solar, na.rm = TRUE),
    .groups    = "drop"
  ) |>
  pivot_longer(-hour_utc, names_to = "source", values_to = "value") |>
  mutate(source = recode(source,
                         mean_ssrd  = "ERA5 solar radiation",
                         mean_solar = "SMARD solar generation")) |>
  ggplot(aes(x = hour_utc, y = value, colour = source)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  facet_wrap(~ source, scales = "free_y") +
  labs(
    title    = "Timestamp alignment: ERA5 vs SMARD (June)",
    subtitle = "Peaks at the same UTC hour confirm correct alignment",
    x        = "Hour (UTC)",
    y        = NULL,
    colour   = NULL,
    caption  = "Sources: Copernicus ERA5, SMARD (BNetzA)"
  ) +
  theme(legend.position = "none")

5.0 Join & capacity factors

Show code

model_raw <- era5 |>
  inner_join(smard, by = "datetime_utc") |>
  mutate(
    date  = as_date(datetime_utc),
    month = factor(month(datetime_utc, label = TRUE), ordered = FALSE),
    hour  = factor(hour(datetime_utc)),
    year  = year(datetime_utc),
    season = case_when(
      month(datetime_utc) %in% c(12, 1, 2) ~ "Winter",
      month(datetime_utc) %in% c(3, 4, 5)  ~ "Spring",
      month(datetime_utc) %in% c(6, 7, 8)  ~ "Summer",
      TRUE                                  ~ "Autumn"
    ),
    season = factor(season, levels = c("Winter", "Spring", "Summer", "Autumn"))
  )

cat("Joined rows:", nrow(model_raw), "\n")

Joined rows: 52489

Computing capacity factors requires dividing observed generation (MW) by installed capacity (MW). SMARD does not include an installed-capacity column, so we construct a lookup from BNetzA / Marktstammdatenregister year-end published totals and interpolate linearly to daily resolution.

Show code

# ── Year-end installed capacity (MW) — BNetzA monitoring reports ─────
# Sources: BNetzA Monitoringbericht, SMARD "Installed generation capacity"
# These are approximate — replace with exact values if available.

installed_cap <- tribble(
  ~date,            ~cap_wind_onshore, ~cap_wind_offshore, ~cap_solar,
  "2019-12-31",     53200,             7700,               49100,
  "2020-12-31",     54400,             7770,               53700,
  "2021-12-31",     55900,             7790,               58700,
  "2022-12-31",     58100,             8050,               66500,
  "2023-12-31",     61000,             8390,               74100,
  "2024-12-31",     63800,             8900,               81700,
  "2025-06-30",     65500,             9200,               86000   # provisional
) |>
  mutate(date = as_date(date))

# ── Linearly interpolate to daily resolution ─────────────────────────
date_seq <- tibble(date = seq(min(installed_cap$date),
                              max(installed_cap$date),
                              by = "day"))

cap_daily <- date_seq |>
  left_join(installed_cap, by = "date") |>
  mutate(across(starts_with("cap_"),
                ~ approx(x = date[!is.na(.x)],
                          y = .x[!is.na(.x)],
                          xout = date)$y))

Show code

model_data <- model_raw |>
  left_join(cap_daily, by = "date") |>
  mutate(
    cf_wind_onshore  = wind_onshore  / cap_wind_onshore,
    cf_wind_offshore = wind_offshore / cap_wind_offshore,
    cf_solar         = solar         / cap_solar
  ) |>
  # Solar CF is only meaningful during daylight; set nighttime to NA
  mutate(cf_solar = if_else(ssrd_wm2 > 0 | solar > 0,
                            cf_solar,
                            NA_real_))

cat("Observations with valid CF:",
    sum(!is.na(model_data$cf_wind_onshore)), "(wind on),",
    sum(!is.na(model_data$cf_wind_offshore)), "(wind off),",
    sum(!is.na(model_data$cf_solar)), "(solar)\n")

Observations with valid CF: 48073 (wind on), 48073 (wind off), 44167 (solar)

6.0 Correlation structure

Show code

cor_vars <- model_data |>
  select(
    `Wind 100m`   = wind_speed_100m,
    `Wind 10m`    = wind_speed_10m,
    `Temp 2m`     = temperature_2m,
    `Solar rad`   = ssrd_wm2,
    `CF onshore`  = cf_wind_onshore,
    `CF offshore` = cf_wind_offshore,
    `CF solar`    = cf_solar
  ) |>
  drop_na()

cor_mat <- cor(cor_vars)

cor_long <- cor_mat |>
  as.data.frame() |>
  rownames_to_column("var1") |>
  pivot_longer(-var1, names_to = "var2", values_to = "r") |>
  mutate(
    var1 = factor(var1, levels = colnames(cor_mat)),
    var2 = factor(var2, levels = colnames(cor_mat))
  ) |>
  filter(as.integer(var1) >= as.integer(var2))

ggplot(cor_long, aes(x = var2, y = var1, fill = r)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.2f", r)), size = 3.2) +
  scale_fill_gradient2(low = "#2166AC", mid = "white", high = "#B2182B",
                       midpoint = 0, limits = c(-1, 1)) +
  labs(
    title    = "Weather–generation correlation matrix",
    subtitle = glue("{zone_file}, hourly observations"),
    fill     = "Pearson r",
    caption  = "Sources: Copernicus ERA5, SMARD (BNetzA)"
  ) +
  coord_equal() +
  theme(axis.title = element_blank(),
        axis.text.x = element_text(angle = 45, hjust = 1))

7.0 Scatter plots: weather → capacity factor

7.1 Wind speed → wind onshore CF

Show code

model_data |>
  filter(!is.na(cf_wind_onshore)) |>
  slice_sample(n = 15000) |>
  ggplot(aes(x = wind_speed_100m, y = cf_wind_onshore)) +
  geom_point(alpha = 0.15, colour = pal["wind_onshore"], size = 0.5) +
  geom_smooth(method = "lm",
              formula = y ~ poly(x, 3),
              colour = "black", linewidth = 0.8, se = TRUE) +
  scale_y_continuous(labels = scales::percent_format(),
                     limits = c(0, 1)) +
  labs(
    title    = "Wind speed → onshore capacity factor",
    subtitle = glue("Degree-3 polynomial fit | {zone_file} hourly"),
    x        = "ERA5 100 m wind speed (m/s)",
    y        = "Capacity factor",
    caption  = "Sources: Copernicus ERA5, SMARD (BNetzA)"
  )

7.2 Solar irradiance → solar PV CF

The marginal relationship between irradiance and solar CF appears non-monotonic when plotted unconditionally. This is not a modelling problem — it reflects the mixing of seasonal regimes. At the same irradiance level (e.g. 100 W/m²), winter midday produces a high CF (that is as good as the sun gets in January) while summer cloud cover at the same irradiance implies suppressed output relative to clear-sky potential. The hour and month fixed effects in the model capture this conditioning; the faceted plots below confirm monotonic relationships within time-of-day strata.

Show code

solar_plot_df <- model_data |>
  filter(!is.na(cf_solar), ssrd_wm2 > 5)

p_marginal <- solar_plot_df |>
  slice_sample(n = 15000) |>
  ggplot(aes(x = ssrd_wm2, y = cf_solar)) +
  geom_point(alpha = 0.15, colour = pal["solar"], size = 0.5) +
  geom_smooth(method = "gam", formula = y ~ s(x),
              colour = "black", linewidth = 0.8, se = TRUE) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Marginal (unconditional)",
    x     = expression("W/m"^2),
    y     = "Capacity factor"
  )

p_faceted <- solar_plot_df |>
  mutate(hour_bin = cut(hour(datetime_utc),
                        breaks = seq(4, 22, by = 2),
                        labels = paste0(seq(4, 20, by = 2), "–",
                                        seq(6, 22, by = 2)),
                        include.lowest = TRUE)) |>
  filter(!is.na(hour_bin)) |>
  slice_sample(n = 20000) |>
  ggplot(aes(x = ssrd_wm2, y = cf_solar)) +
  geom_point(alpha = 0.16, colour = pal["solar"], size = 0.3) +
  geom_smooth(method = "gam", formula = y ~ s(x),
              colour = "black", linewidth = 0.7) +
  facet_wrap(~ hour_bin, ncol = 3) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Conditioned on time of day (UTC)",
    x     = expression("W/m"^2),
    y     = "Capacity factor"
  )

p_marginal | p_faceted +
  plot_annotation(
    title   = "Solar irradiance → PV capacity factor",
    caption = "Sources: Copernicus ERA5, SMARD (BNetzA)"
  )

7.3 Midday deep dive: season resolves the remaining non-monotonicity

Even within the 10–12 UTC bin, the relationship appears non-monotonic. Faceting by season confirms this is a solar geometry effect: the same irradiance value implies very different things in winter versus summer.

Show code

model_data |>
  filter(!is.na(cf_solar), ssrd_wm2 > 5,
         hour(datetime_utc) %in% 10:11) |>
  ggplot(aes(x = ssrd_wm2, y = cf_solar)) +
  geom_point(alpha = 0.18, colour = pal["solar"], size = 0.5) +
  geom_smooth(method = "gam", formula = y ~ s(x),
              colour = "black", linewidth = 0.8) +
  facet_wrap(~ season) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title    = "10–12 UTC: irradiance → CF by season",
    subtitle = "Within-season monotonicity confirms need for month × irradiance interactions",
    x        = expression("W/m"^2),
    y        = "Capacity factor",
    caption  = "Sources: Copernicus ERA5, SMARD (BNetzA)"
  )

8.0 Capacity factor models

We fit one tidymodels workflow per technology using a temporal 80/20 train/test split (no shuffling — critical for time-series validity). Wind models use polynomial wind-speed terms with month/hour fixed effects. The solar model includes month × irradiance interaction terms to capture the seasonal variation in the irradiance–CF slope identified in the EDA above.

8.1 Wind onshore

Show code

wind_on_df <- model_data |>
  filter(!is.na(cf_wind_onshore)) |>
  select(datetime_utc, cf_wind_onshore,
         wind_speed_100m, wind_speed_10m, temperature_2m,
         month, hour)

split_wind_on <- initial_time_split(wind_on_df, prop = 0.8)
cat("Train:", nrow(training(split_wind_on)),
    "| Test:", nrow(testing(split_wind_on)), "\n")

Train: 38458 | Test: 9615

Show code

rec_wind_on <- recipe(
  cf_wind_onshore ~ wind_speed_100m + wind_speed_10m +
                     temperature_2m + month + hour,
  data = training(split_wind_on)
) |>
  step_poly(wind_speed_100m, degree = 3) |>
  step_poly(wind_speed_10m,  degree = 2) |>
  step_dummy(all_nominal_predictors())

wf_wind_on <- workflow() |>
  add_recipe(rec_wind_on) |>
  add_model(linear_reg())

fit_wind_on <- fit(wf_wind_on, data = training(split_wind_on))

preds_wind_on <- fit_wind_on |>
  augment(testing(split_wind_on))

metrics_wind_on <- preds_wind_on |>
  metrics(truth = cf_wind_onshore, estimate = .pred)

metrics_wind_on |>
  knitr::kable(digits = 4, caption = "Wind onshore model — test set performance")

Wind onshore model — test set performance
.metric	.estimator	.estimate
rmse	standard	0.0416
rsq	standard	0.9167
mae	standard	0.0307

8.2 Wind offshore

Show code

wind_off_df <- model_data |>
  filter(!is.na(cf_wind_offshore)) |>
  select(datetime_utc, cf_wind_offshore,
         wind_speed_100m, wind_speed_10m, temperature_2m,
         month, hour)

split_wind_off <- initial_time_split(wind_off_df, prop = 0.8)

rec_wind_off <- recipe(
  cf_wind_offshore ~ wind_speed_100m + wind_speed_10m +
                      temperature_2m + month + hour,
  data = training(split_wind_off)
) |>
  step_poly(wind_speed_100m, degree = 3) |>
  step_poly(wind_speed_10m,  degree = 2) |>
  step_dummy(all_nominal_predictors())

wf_wind_off <- workflow() |>
  add_recipe(rec_wind_off) |>
  add_model(linear_reg())

fit_wind_off <- fit(wf_wind_off, data = training(split_wind_off))

preds_wind_off <- fit_wind_off |>
  augment(testing(split_wind_off))

metrics_wind_off <- preds_wind_off |>
  metrics(truth = cf_wind_offshore, estimate = .pred)

metrics_wind_off |>
  knitr::kable(digits = 4, caption = "Wind offshore model — test set performance")

Wind offshore model — test set performance
.metric	.estimator	.estimate
rmse	standard	0.1852
rsq	standard	0.2855
mae	standard	0.1507

Offshore wind model — known limitation

The offshore wind model achieves an R² of only ~0.29 on the test set — substantially weaker than onshore wind (R² ~0.92) or solar PV (R² ~0.84). This is a spatial resolution problem, not a specification issue: ERA5 wind speeds are averaged over the entire DE-LU zone using cosine-latitude weighting, which dilutes the coastal and maritime wind signals that drive offshore turbine output. Offshore farms in the German North Sea and Baltic experience wind regimes that differ meaningfully from the zone-average conditions captured by the ERA5 spatial mean.

Downstream implications. This model is used in Stage 3 to predict offshore capacity factors for the Monte Carlo revenue simulation. At R² = 0.29, the predictions are essentially noisy — they will understate the true range of offshore generation outcomes. We flag this as a known limitation of the current data pipeline. A targeted improvement would be to extract ERA5 grid cells overlapping German offshore wind farm clusters (e.g. the Helgoland/Borkum cluster in the North Sea) rather than using the full DE-LU spatial average. For this iteration, the Stage 3 portfolio analysis focuses on onshore wind where the CF model is reliable.

8.3 Solar PV

The solar recipe includes month × irradiance interactions so that the irradiance–CF slope can vary by month — physically motivated by the seasonal change in solar geometry (sun angle, day length) identified in Section 8.3. Temperature is included to capture the ~0.4%/°C efficiency loss above standard test conditions (25 °C).

Show code

solar_df <- model_data |>
  filter(!is.na(cf_solar), ssrd_wm2 > 0) |>
  select(datetime_utc, cf_solar,
         ssrd_wm2, temperature_2m,
         month, hour)

split_solar <- initial_time_split(solar_df, prop = 0.8)

rec_solar <- recipe(
  cf_solar ~ ssrd_wm2 + temperature_2m + month + hour,
  data = training(split_solar)
) |>
  step_dummy(all_nominal_predictors()) |>
  step_interact(terms = ~ ssrd_wm2:starts_with("month_"))

wf_solar <- workflow() |>
  add_recipe(rec_solar) |>
  add_model(linear_reg())

fit_solar <- fit(wf_solar, data = training(split_solar))

preds_solar <- fit_solar |>
  augment(testing(split_solar))

metrics_solar <- preds_solar |>
  metrics(truth = cf_solar, estimate = .pred)

metrics_solar |>
  knitr::kable(digits = 4, caption = "Solar PV model — test set performance")

Solar PV model — test set performance
.metric	.estimator	.estimate
rmse	standard	0.0663
rsq	standard	0.8407
mae	standard	0.0515

9.0 Model diagnostics

Two diagnostic panels per technology: predicted versus actual capacity factor (a well-calibrated model hugs the 1:1 line) and the residual distribution (centred near zero with no heavy skew indicates the model specification is adequate).

Show code

diag_panels <- function(preds, truth_col, label, colour) {
  n_plot <- min(8000, nrow(preds))

  p_scatter <- preds |>
    slice_sample(n = n_plot) |>
    ggplot(aes(x = .data[[truth_col]], y = .pred)) +
    geom_point(alpha = 0.15, colour = colour, size = 0.5) +
    geom_abline(slope = 1, intercept = 0,
                linetype = "dashed", colour = "grey30") +
    scale_x_continuous(labels = scales::percent_format()) +
    scale_y_continuous(labels = scales::percent_format()) +
    coord_equal(xlim = c(0, 1), ylim = c(0, 1)) +
    labs(title = label, x = "Actual CF", y = "Predicted CF")

  p_resid <- preds |>
    mutate(resid = .pred - .data[[truth_col]]) |>
    ggplot(aes(x = resid)) +
    geom_histogram(bins = 80, fill = colour, alpha = 0.7) +
    geom_vline(xintercept = 0, linetype = "dashed", colour = "grey30") +
    labs(title = paste(label, "— residuals"),
         x = "Residual (pred − actual)", y = "Count")

  p_scatter + p_resid
}

p1 <- diag_panels(preds_wind_on,  "cf_wind_onshore",  "Wind onshore",  pal["wind_onshore"])
p2 <- diag_panels(preds_wind_off, "cf_wind_offshore", "Wind offshore", pal["wind_offshore"])
p3 <- diag_panels(preds_solar,    "cf_solar",         "Solar PV",      pal["solar"])

(p1 / p2 / p3) +
  plot_annotation(
    title   = glue("Capacity factor model diagnostics — {zone_file} test set"),
    caption = "Sources: Copernicus ERA5, SMARD (BNetzA)"
  )

9.1 Residual autocorrelation

The scatter and histogram diagnostics above assess bias and distributional shape, but not temporal dependence. If model residuals are strongly autocorrelated, it indicates systematic patterns the specification is missing — for instance, multi-day weather persistence or seasonal ramp events that the polynomial + fixed effects structure cannot capture.

We plot the ACF and PACF for the wind onshore and solar PV test-set residuals (offshore is omitted given its low R²). Significant autocorrelation at short lags (1–6 hours) is expected — weather persistence is physical — but decay beyond ~24 hours would suggest the model is adequate for the hourly predictions used in Stage 3.

Show code

acf_plot <- function(preds, truth_col, label, colour, max_lag = 72) {
  resids <- preds |>
    arrange(datetime_utc) |>
    mutate(resid = .pred - .data[[truth_col]]) |>
    pull(resid)

  acf_vals  <- acf(resids, lag.max = max_lag, plot = FALSE)
  pacf_vals <- pacf(resids, lag.max = max_lag, plot = FALSE)

  ci <- qnorm(0.975) / sqrt(length(resids))

  p_acf <- tibble(lag = acf_vals$lag[-1], acf = acf_vals$acf[-1]) |>
    ggplot(aes(x = lag, y = acf)) +
    geom_hline(yintercept = c(-ci, ci), linetype = "dashed", colour = "grey50") +
    geom_hline(yintercept = 0, colour = "grey30") +
    geom_segment(aes(xend = lag, yend = 0), colour = colour, linewidth = 0.5) +
    labs(title = paste(label, "— ACF"), x = "Lag (hours)", y = "ACF")

  p_pacf <- tibble(lag = pacf_vals$lag, pacf = pacf_vals$acf) |>
    ggplot(aes(x = lag, y = pacf)) +
    geom_hline(yintercept = c(-ci, ci), linetype = "dashed", colour = "grey50") +
    geom_hline(yintercept = 0, colour = "grey30") +
    geom_segment(aes(xend = lag, yend = 0), colour = colour, linewidth = 0.5) +
    labs(title = paste(label, "— PACF"), x = "Lag (hours)", y = "PACF")

  p_acf + p_pacf
}

p_acf_wind <- acf_plot(preds_wind_on, "cf_wind_onshore", "Wind onshore", pal["wind_onshore"])
p_acf_solar <- acf_plot(preds_solar, "cf_solar", "Solar PV", pal["solar"])

(p_acf_wind / p_acf_solar) +
  plot_annotation(
    title   = glue("Residual autocorrelation — {zone_file} CF models (test set)"),
    subtitle = "Dashed lines = 95% confidence bounds under white noise null",
    caption = "Sources: Copernicus ERA5, SMARD (BNetzA)"
  )

Figure 11: Residual ACF and PACF for the wind onshore and solar PV capacity factor models (test set). Significant short-lag autocorrelation reflects physical weather persistence; rapid decay beyond 24–48 hours indicates the model captures the dominant temporal structure.

Interpreting the ACF/PACF

Significant autocorrelation at lags 1–6 is expected and not concerning for this application: weather systems persist for hours, so a model that conditions on instantaneous weather will inevitably leave correlated residuals. What matters for downstream use is whether the residuals exhibit systematic multi-day structure. If the ACF decays to near zero by lag 48, the model is capturing the dominant temporal features and the remaining autocorrelation reflects irreducible weather persistence. If the PACF shows spikes at 24-hour intervals, a diurnal component may be under-specified — but this is unlikely given the hour fixed effects already in the recipe.

10.0 Save fitted workflows

The fitted tidymodels workflow objects are serialised to models/ for reuse in Stage 2 (price decomposition) and Stage 3 (risk simulation). Each .rds file contains the full preprocessing recipe plus fitted coefficients — predict(readRDS(...), new_data) is all downstream stages need.

Show code

dir.create("models", showWarnings = FALSE)

saveRDS(fit_wind_on,  glue("models/cf_wind_onshore_{zone_file}.rds"))
saveRDS(fit_wind_off, glue("models/cf_wind_offshore_{zone_file}.rds"))
saveRDS(fit_solar,    glue("models/cf_solar_{zone_file}.rds"))

cat(glue("Saved 3 workflow objects to models/ ({zone_file})"), "\n")

Saved 3 workflow objects to models/ (DE-LU)

11.0 Key findings

Show code

bind_rows(
  metrics_wind_on  |> mutate(technology = "Wind onshore"),
  metrics_wind_off |> mutate(technology = "Wind offshore"),
  metrics_solar    |> mutate(technology = "Solar PV")
) |>
  select(technology, .metric, .estimate) |>
  pivot_wider(names_from = .metric, values_from = .estimate) |>
  knitr::kable(
    digits = 4,
    caption = glue("Summary of capacity factor model performance on the temporal test set ({zone_file})")
  )

Summary of capacity factor model performance on the temporal test set (DE-LU)
technology	rmse	rsq	mae
Wind onshore	0.0416	0.9167	0.0307
Wind offshore	0.1852	0.2855	0.1507
Solar PV	0.0663	0.8407	0.0515

Cross-zone context. The ERA5 weather profiles in Section 4 reveal substantial geographic variation in renewable resource quality. Wind speeds are highest in coastal and Nordic zones, while solar irradiance follows a clear latitude gradient. These physical differences drive the cross-zone variation in renewable generation patterns — and ultimately in the price dynamics — explored in Stage 2.

Derived weather features. Centred temperature features (Section 3) translate ERA5 temperature into demand-relevant signals for Stage 2. Using temp_centered (T - 18°C) and temp_squared rather than separate HDD/CDD ensures non-zero variance in every season, avoiding collinearity issues in per-season decompositions. Germany’s heating-dominated climate means the negative (heating) side of the distribution is substantially wider than the positive (cooling) side.

DE-LU capacity factor models. ERA5 100 m wind speed is the dominant predictor of wind generation capacity factors, with the cubic polynomial capturing the characteristic ramp-up and saturation of a turbine power curve. Solar irradiance maps onto PV capacity factors with a slope that varies by month — an interaction effect driven by solar geometry (sun angle and day length). The month × irradiance terms resolve the apparent non-monotonicity visible in the unconditional scatter plot (Section 8.2), confirming that the bimodal pattern reflects seasonal regime mixing rather than a data quality issue. Temperature adds a modest correction for PV cell-efficiency losses above standard test conditions (25 °C).

Offshore wind limitation. The offshore CF model (R² ~0.29) is substantially weaker than onshore or solar, due to the spatial averaging of ERA5 over the full DE-LU zone diluting coastal wind signals. This is a known limitation flagged in Section 9.2; the Stage 3 portfolio analysis focuses on onshore wind where the CF model is reliable.

Residual autocorrelation. The ACF/PACF analysis (Section 11) confirms that short-lag autocorrelation in CF model residuals reflects physical weather persistence rather than model misspecification. Rapid decay beyond 24–48 hours indicates the polynomial + fixed effects specification captures the dominant temporal structure.

These fitted workflows feed directly into Stage 2, where weather-driven generation predictions are used as features in the day-ahead price decomposition model.