Companion repository for the dataset paper SURGE: an event-centric social media benchmark pairing sentiment time series with bin-aligned text and reply / repost interaction structure.
The dataset covers two paper targets — Discussion Intensity
(
data/events/<event_name>_<granularity>/
├── comment_count.csv # DI, raw (NaN for empty bins)
├── comment_count_normalized.csv # DI, per-event z-score (NaN preserved)
├── sentiment_polarity.csv # SP, raw
├── sentiment_polarity_normalized.csv # SP, per-event z-score
├── normalization.json # train-split-only statistics
└── text_view.jsonl # per-bin top-3 main + earliest-2 replies (post-IDs)
data/events/<event_name>/
├── edges.jsonl # reply / repost edges (post-IDs, ISO times)
└── post_id_lookup.jsonl # post_id -> (platform, url) for hydration
Bins with zero observed posts are encoded as NaN. The benchmark
loader performs split-internal forward fill followed by backward fill
within each chronological 70 / 10 / 20 segment, so no imputation
reference crosses a split boundary. Field-level documentation lives
in schema/.
The release covers 67 events at 6H, 64 events at 12H, and 55
events at 1D. The full event registry is
data/events/event_metadata.json, also exposed programmatically
through event_config.py.
pip install -r requirements.txt
# Inspect the registry
python -c "from event_config import get_real_events; print(len(get_real_events()))"
# 67
# Load a granularity into the standard pooled train / val / test loaders
python -c "
from benchmark.data_loader import create_dataloaders
train, val, test, meta = create_dataloaders(
data_dir='data/events', interval='1D',
variable='sentiment_polarity', seq_len=14, pred_len=7,
)
print('events:', len(meta['event_names']),
'train/val/test windows:', len(train.dataset), len(val.dataset), len(test.dataset))
".
├── README.md
├── LICENSE MIT for code; CC BY 4.0 for author-created data
├── DATASHEET.md Datasheet excerpt
├── CHANGELOG.md
├── requirements.txt
├── event_config.py EventConfig dataclass + 67-event registry loader
├── common/ Shared schemas, IO and time utilities
├── benchmark/
│ ├── data_loader.py Per-event dataloader with split-internal imputation
│ ├── train.py Generic training loop
│ ├── evaluate.py MAE / MSE / MAE_reply(k%) metrics
│ ├── mae_reply_utils.py Reply-ratio computation from edges.jsonl
│ └── cma/ CMA reference probe
│ ├── dataset.py
│ ├── blocks.py
│ └── run_cma.py
├── schema/ Field-level documentation
├── references/
│ ├── README.md Pointers to upstream baseline repositories
│ └── code/MM-TSFlib-main/ Vendored MM-TSFlib backbone (MIT)
└── data/
├── events/ Per-event released artifacts
└── synthetic_examples/ Two synthetic mini-events demonstrating the schema
See LICENSE. Code is MIT; author-created derivative data under
data/ is CC BY 4.0.
A BibTeX entry will be added at camera-ready time.