Skip to content

MTG/dcase2026_task1_baseline

Repository files navigation

DCASE2026 Challenge: Task 1 - Baseline System

Contact: Panagiota Anastasopoulou (panagiota.anastasopoulou@upf.edu), Music Technology Group, Universitat Pompeu Fabra, Barcelona

Heterogeneous Audio Classi1cation

This task focuses on heterogeneous audio classification using the Broad Sound Taxonomy (BST), which comprises 5 top-level and 23 second-level sound categories. The goal of this task is to evaluate sound classification models on diverse, real-world audio that varies widely in its nature, including duration and recording conditions. To that end, two complementary Freesound-based datasets are provided: a curated set, BSD10k-v1.2, and a larger, noisier, crowd-sourced collection, BSD35k-CS, which reflects real-world labeling variability. Participants are encouraged to explore audio-based and multimodal approaches to sound classification, as well as to leverage hierarchical relationships between taxonomy categories.

For a detailed description of the challenge and this task visit the DCASE website.

Baseline System

This repository contains the code for the baseline systems of the DCASE 2026 Challenge Task 1. It provides a full pipeline for training and evaluating an audio classification model using precomputed audio and text embeddings.

As a baseline system, we use variations of the HATR model presented at the DCASE Workshop 2025 [1]. We include a multimodal and an audio-only version, both non-hierachical models trained on audio (and text) representation vectors extracted using the 630k-audioset-fusion-best.pt checkpoint of the pretrained LAION-CLAP model.

The model is characterized by:

  • Multimodality: Supports both audio and text input (embeddings) with separate encoders
  • Attention-based fusion: Learns to weight modalities dynamically
  • Residual-based classifier: Stacked residual blocks
  • Data augmentation: Gaussian noise and random masking

For the evaluation phase, apart from standard accuracy (micro and macro in both levels), we additionally compute hierarchical metrics (accuracy, precision, recall, F1) as part of the challenge's rules and ranking system.

Quick Start

  1. Clone this repository.

  2. Create and activate a conda environment:

conda create -n hac python=3.13
conda activate hac
  1. Install requirements:
pip3 install -r requirements.txt

You can edit the PyTorch version if necessary to suit your system.

  1. Download and extract the datasets: BSD10k-v1.2 and BSD35k-CS. Their file structure is described in their READMEs.

  2. Specify the input and output paths in config.yaml. Make sure all paths point to the correct directories or files before running the model. By default, all generated files for internal model use are stored in the data/ directory. We also assume that datasets are placed or symlinked into this directory.

  3. Run the data preparation script:

python build_dataset.py
  1. Train the model:
python train_test.py

This script includes k-fold training and evaluation of the models on their respective test sets.

You can run only evaluation with evaluate.py.

  1. Summarize the results:
python summarize_results.py

This script summarizes results for each model across all 5 folds.

Note: You can skip steps 6-8 and simply run: python main.py

Baseline Results

Pending (to be added soon)

Citations

[1] Panagiota Anastasopoulou, Jessica Torrey, Xavier Serra, and Frederic Font. Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset. In Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). 2024.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages