Skip to content

danareu/ts_clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ts_clustering

Time-Series Clustering Framework for GENeSYS-MOD

Aggregates hourly temporal profiles (e.g., solar PV capacity factors, wind capacity factors, electricity demand) into a set of representative days. Importantly, the input data must follow the given format as outlined in GENeSYS-MOD.data**


Table of Contents


Background

Energy system models like GENeSYS-MOD require full-year hourly profiles (8,760 time steps) as inputs. Solving an optimisation at this resolution is computationally expensive. Time-series clustering reduces those 8,760 hours to a small set of k representative periods while preserving key statistical properties of the original profiles.


Features

Feature Options
Clustering algorithm Hierarchical – Ward's linkage, Complete linkage
Distance metric Euclidean, Dynamic Time Warping (DTW)
DTW implementation FastDTW (approximate), Dynamic Programming (exact)
Normalisation z-score (optional)
Visualisation Interactive plots via PlotlyJS
Input format XLSX
Configuration YAML

Repository Structure

ts_clustering/
├── TSClustering.jl          # Module entry point
├── Project.toml             # Dependency declarations
├── Manifest.toml            # Fully-resolved lockfile
├── src/
│   ├── clustering.jl        # Core clustering logic
│   └── plot_results.jl      # Visualisation (PlotlyJS)
├── utils/
│   ├── load_data.jl         # XLSX data ingestion
│   ├── datastructs.jl       # Custom types and data structures
│   └── post_processing.jl   # Representative profile extraction & weighting
└── data/                    # Input time-series files (XLSX)

Installation

Prerequisites: Julia 1.8+

# 1. Clone the repo
# git clone https://github.com/danareu/ts_clustering.git
# cd ts_clustering

# 2. Activate the environment and install dependencies
using Pkg
Pkg.activate(".")
Pkg.instantiate()

Configuration

Runs are controlled by a YAML configuration file:

Country_Data_Entries: # the attributes in the xlsx files for which representative periods should be found
  - TS_LOAD
  - TS_PV_AVG          
Clustering: # the attributes in the xlsx files that should be used for the clustering algorithm
  - TS_LOAD
  - TS_PV_AVG
countries: # the country profiles for each attribute that should be considered
  - DE
SCTOLERANCE: 10.0e-6 # the max. accuracy between the mean of the aggregated and the original time-series
Load: # the profiles that represent a load and are therefore normalized 
  - TS_LOAD

Usage

using Pkg; Pkg.activate(".")

include("TSClustering.jl")
using .TSClustering

Step-by-step pipeline

1. Load data

# Hourly profiles are read. A dictionary with the time-series attributes is generated using key_mappings which maps the sheet names to attributes 𝓣
hourly_data = XLSX.readxlsx(joinpath(inputdir, hourly_data_file * ".xlsx"))
CountryData = Dict(t => DataFrame(XLSX.gettable(hourly_data[keys_mapping[t]])) for t  𝓣)

2. Normalise & build the clustering matrix

data = TSClustering.normalize_data(config=config, CountryData=CountryData)

# With PCA (if pca_path is set):
pca_res = TSClustering.derive_principal_components(config=config, CountryData=CountryData, technology=pca_ts)
data_clustering = TSClustering.create_clustering_matrix(
    technology=["PCA"],
    CountryData=Dict("PCA" => DataFrame(Matrix(pca_res), :auto))
)

# Without PCA — seasonal profiles excluded from the clustering matrix:
data_clustering = TSClustering.create_clustering_matrix(technology=setdiff(𝓣, seasonal), CountryData=data)

3. Compute the distance matrix

D = TSClustering.define_distance(w=warping_window, data_clustering=data_clustering, fast_dtw=false)

4. Cluster

# Hierarchical Ward:
result = hclust(D, linkage=:ward)
cl = cutree(result, k=clusters)

# K-Means (when "Kmeans" ∈ resultdir):
Random.Seed(1)
R = kmeans(data_clustering, clusters; maxiter=200, display=:iter)
cl = assignments(R)

5. Compute representative profiles

Three methods are available, controlled by hoffmann and resultdir:

# Hoffmann — optimised hourly distribution per cluster:
sc1 = TSClustering.calculate_representative_value_distribution(
    data_org=filter(kv -> kv[1]  seasonal, CountryData), cl=cl, config=config, K=clusters
)

# Medoid — most central day per cluster (default):
cluster_dict_org = TSClustering.calculate_medoid(
    data_org=CountryData, cl=cl, config=config, K=clusters, technology=𝓣
)
sc = TSClustering.scaling(
    data_org=CountryData, scaled_clusters=cluster_dict_org,
    k=clusters, weights=weights, config=config, technology=𝓣
)

# Centroid — mean profile per cluster:
cluster_dict_org = TSClustering.calculate_centroid(
    data_org=CountryData, cl=cl, config=config, K=clusters, technology=𝓣
)

Seasonal profiles: Technologies listed in seasonal are always represented via medoid and are excluded from the DTW clustering step. They are handled separately via TSClustering.scaling after the main clustering is done.

Heat pumps: HLR_Heatpump_Aerial and HLR_Heatpump_Ground are written to TimeDepEfficiency rather than CapacityFactor, reflecting their time-dependent COP.

6. Write outputs

Set write_reduced_timeserie = 1 to write results to:

{inputdir}/input_reduced_timeseries_{k}_{resultdir_stem}.xlsx

Sheets written: SpecifiedDemandProfile, CapacityFactor, TimeDepEfficiency, YearSplit.


Outputs

The pipeline returns the following JuMP DenseAxisArray objects:

Return value Axes Description
SpecifiedDemandProfile [Region, Fuel, Timeslice, Year] Normalised hourly demand profiles; each cluster's 24 values sum to weights[k] / 365
CapacityFactor [Region, Technology, Timeslice, Year] Hourly capacity factors for renewable technologies
TimeDepEfficiency [Region, Technology, Timeslice, Year] Time-dependent COP for heat pumps
YearSplit [Timeslice, Year] Fraction of the year per timeslice (weights[k] / 8760, repeated 24× per cluster)
cl Vector{Int} Cluster assignment for each of the 365 input days
weights Dict{Int, Int} Number of days assigned to each cluster (sums to 365)

Methodology

Hierarchical clustering (default)

Agglomerative clustering with Ward's minimum-variance linkage via Clustering.jl. A full 365×365 DTW distance matrix is computed once and the resulting tree is cut at depth k.

Dynamic Time Warping

DTW allows non-linear temporal alignment, making it more robust than Euclidean distance for renewable profiles that are structurally similar but temporally shifted. warping_window constrains the maximum allowed shift, balancing accuracy against compute time. DTW is implemented via DynamicAxisWarping.jl.

Representative value methods

  • Medoid — selects the actual day closest to each cluster centre; always produces physically plausible profiles.
  • Centroid — averages all days in a cluster; smoother but can produce unrealistic values.
  • Hoffmann — selects the representative which finds the hourly distribution that best preserves the statistical properties of the original data. The method was published in https://www.sciencedirect.com/science/article/abs/pii/S0306261922004342

Demand profile normalisation

SpecifiedDemandProfile values within each cluster are normalised so the 24 hourly values sum to weights[k] / 365, preserving each cluster's correct share of annual energy. Any NaN values from zero-sum clusters are set to zero with a warning.


Dependencies

Package Role
Clustering.jl hclust, cutree, kmeans
Distances.jl Distance metric primitives
DynamicAxisWarping.jl DTW distance computation
DataFrames.jl Tabular data manipulation
XLSX.jl Input/output Excel files
YAML.jl Configuration parsing
MultivariateStats.jl PCA
PlotlyJS.jl Interactive result visualisation
Statistics.jl Mean, std, etc.
DelimitedFiles.jl Text file I/O
Dates.jl Runtime benchmarking

About

Time-Series Clustering Framework to aggregate the input data for the energy system model GENeSYS-MOD

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages