This repository contains the official implementation for the paper "[Nonparametric Data Attribution for Diffusion Models]".
[arXiv]
To run the experiments, you'll need to set up the appropriate environment.
-
Clone the Repository:
git clone https://github.com/sail-sg/NDA.git
-
Create a Conda Environment:
conda create -n nda python=3.9 -y conda activate nda
-
Install Dependencies:
pip install -r requirements.txt
We provide commands for running experiments on CIFAR-2. And it can be easily transfered to other datasets.
-
Navigate to the Dataset Directory:
cd CIFAR2 -
Prepare the Dataset: Run the
00_EDA.ipynbnotebook to generate the CIFAR2 dataset and prepare the necessary subsets for LDS. -
Train the Diffusion Model: Use the provided script to train a diffusion model.
bash scripts/run_train.sh 0 18888 5000-0.5
-
Generate Images: After training, generate images using:
bash scripts/run_gen.sh 0 0 5000-0.5
-
Compute Attribution Score: Use the following scripts to compute attribution scores on either the generated set (
gen) or the validation set (val).
Key Parameters
-
t_fixed: diffusion timestep, e.g.,100/200/300/400/500. -
patch_size: patch size for original scale. -
topk: keep only top-K patches for each image. -
Downscale / Two-scale only:
patch_size_downandalpha: the downscaled patch size and the ratio between single-scale weight and downscaled weight.
-
gen_source: choosegenorval. -
mask_value: penalty for invalid patch locations. -
kernel_batch_size: controls GPU memory footprint—larger values are faster but use more VRAM; smaller values reduce memory usage at the cost of speed. Tune up to the largest value that does not cause OOM on your GPU.bash scripts/run_score_orig.sh 0 5000-0.5 0 1000 100 7 100 gen 10 64 bash scripts/run_score_orig.sh 0 5000-0.5 0 1000 100 7 100 val 10 64 bash scripts/run_score_downscale.sh 0 5000-0.5 0 1000 100 14 100 gen 10 64 bash scripts/run_score_downscale.sh 0 5000-0.5 0 1000 100 14 100 val 10 64 bash scripts/run_score_twoscale.sh 0 5000-0.5 0 1000 100 14 100 10 0.5 gen 10 64 bash scripts/run_score_twoscale.sh 0 5000-0.5 0 1000 100 14 100 10 0.5 val 10 64
-
LDS Benchmark
Train 64 models corresponding to 64 subsets of the training set.
bash scripts/run_lds_val_sub.sh 0 18888 5000-0.5 0 63Evaluate the model outputs on the validation set
bash scripts/run_eval_lds_val_sub.sh 0 0 5000-0.5 idx_val.pkl 0 63 bash scripts/run_eval_lds_val_sub.sh 0 1 5000-0.5 idx_val.pkl 0 63 bash scripts/run_eval_lds_val_sub.sh 0 2 5000-0.5 idx_val.pkl 0 63Evaluate the model outputs on the generation set
bash scripts/run_eval_lds_val_sub.sh 0 0 5000-0.5 idx_gen.pkl 0 63 bash scripts/run_eval_lds_val_sub.sh 0 1 5000-0.5 idx_gen.pkl 0 63 bash scripts/run_eval_lds_val_sub.sh 0 2 5000-0.5 idx_gen.pkl 0 63 -
Evaluate the LDS scores Run notebooks in evaluation:
If you find this project useful in your research, please consider citing our paper:
@article{zhao2025nonparametricdataattributiondiffusion,
title={Nonparametric Data Attribution for Diffusion Models},
author={Yutian Zhao and Chao Du and Xiaosen Zheng and Tianyu Pang and Min Lin},
journal={arXiv preprint arXiv:2510.14269},
year={2025}
}This repository is partially adapted from the official implementation of Intriguing Properties of Data Attribution on Diffusion Models. We thank the authors for making their code and LDS dataset available (Code Link).
