Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

🔮 Welcome to the official code repository of our TA-STVG. We're excited to share our work with you.

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
Xin Gu, Yaojie Shen, Chenxi Luo, Tiejian Luo, Yan Huang, Yuewei Lin, Heng Fan^$\dagger$, Libo Zhang^$\dagger$
($\dagger$: equal advising and co-last authors)
International Conference on Learning Representations (ICLR), 2025. (Oral)
[arXiv]

Illustration of Idea

💡 Motivation: In the decoder procedure, object queries expect to learn target position information from multimodal features. If object queries know the target from the very beginning, or in other words, they know what to learn, they can employ target-specific cues as a prior to guide themselves for better interaction with the multimodal features, which benefits learning more discriminative features for better localization.

Figure: Comparison between existing Transformer-based STVG methods applying zero-initialized queries for STVG in (a) and our proposed Target-Aware Transformer-based STVG generating queries with target-aware cues from video and text for STVG in (b)

Framework

Figure: Overview of TA-STVG, which exploits target-specific information from the video and text (i.e., features from multimodal encoder) for generating spatial and temporal object queries for STVG. More details can be seen in the paper.

Implementation

Dataset Preparation

The used datasets are placed in data folder with the following structure.

data
|_ vidstg
|  |_ videos
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ annos
|  |  |_ train.json
|  |  |_ ...
|  |_ sent_annos
|  |  |_ train_annotations.json
|  |  |_ ...
|  |_ data_cache
|  |  |_ ...
|_ hc-stvg2
|  |_ v2_video
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ annos
|  |  |_ train.json
|  |  |_ test.json
|  |  data_cache
|  |  |_ ...
|_ hc-stvg
|  |_ v1_video
|  |  |_ [video name 0].mp4
|  |  |_ [video name 1].mp4
|  |  |_ ...
|  |_ annos
|  |  |_ train.json
|  |  |_ test.json
|  |  data_cache
|  |  |_ ...

First, download the data file, which contains annotations for three datasets. Then, download the videos of the following three datasets respectively and put them into the specified path. The download link for the above-mentioned document is as follows:

HC-STVG (v1_video), HC-STVG2 (v2_video), VidSTG (videos)

Model Preparation

The used datasets are placed in model_zoo folder

ResNet-101, VidSwin-T, roberta-base

Requirements

The code has been tested and verified using PyTorch 2.0.1 and CUDA 11.7. However, compatibility with other versions is also likely. To install the necessary requirements, please use the commands provided below:

pip3 install -r requirements.txt
apt install ffmpeg -y

Training

Please utilize the script provided below:

# run for HC-STVG
python3 -m torch.distributed.launch \
    --nnodes $WORKER_NUM \
    --node_rank $NODE_ID \
    --nproc_per_node $WORKER_GPU \
    --master_addr $WORKER_0_HOST \
    --master_port $PORT \
    scripts/train_net.py \
    --config-file "experiments/hcstvg.yaml" \
    OUTPUT_DIR output/hcstvg \
    TENSORBOARD_DIR output/hcstvg

# run for HC-STVG2
python3 -m torch.distributed.launch \
    --nnodes $WORKER_NUM \
    --node_rank $NODE_ID \
    --nproc_per_node $WORKER_GPU \
    --master_addr $WORKER_0_HOST \
    --master_port $PORT \
    scripts/train_net.py \
    --config-file "experiments/hcstvg2.yaml" \
    OUTPUT_DIR output/hcstvg2 \
    TENSORBOARD_DIR output/hcstvg2

# run for VidSTG
python3 -m torch.distributed.launch \
    --nnodes $WORKER_NUM \
    --node_rank $NODE_ID \
    --nproc_per_node $WORKER_GPU \
    --master_addr $WORKER_0_HOST \
    --master_port $PORT \
    scripts/train_net.py \
    --config-file "experiments/vidstg.yaml" \
    OUTPUT_DIR output/vidstg \
    TENSORBOARD_DIR output/vidstg

For additional training options, such as utilizing different hyper-parameters, please adjust the configurations as needed: experiments/hcstvg.yaml, experiments/hcstvg2.yaml and experiments/vidstg.yaml. All trainings are completed on 32 A100 GPUs.

Evaluation

Please utilize the script provided below:

# run for HC-STVG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/hcstvg.yaml" \
 MODEL.WEIGHT [Pretrained Model Weights] \
 OUTPUT_DIR output/hcstvg
 
# run for HC-STVG2
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/hcstvg2.yaml" \
 MODEL.WEIGHT [Pretrained Model Weights] \
 OUTPUT_DIR output/hcstvg2

# run for VidSTG
python3 -m torch.distributed.launch \
 --nproc_per_node=8 \
 scripts/test_net.py \
 --config-file "experiments/vidstg.yaml" \
 MODEL.WEIGHT [Pretrained Model Weights] \
 OUTPUT_DIR output/vidstg

Pretrained Model Weights

We provide our trained checkpoints for results reproducibility.

Dataset	Resolution	Url	m_tIoU/m_vIoU/vIoU@0.3/vIoU@0.5	Size
HC-STVG	420	Model	53.0 / 39.1 / 63.1 / 36.8	1.9 GB
HC-STVG2	420	Model	60.4 / 40.2/ 65.8 / 36.7	1.9 GB
VidSTG	420	Model	51.7 / 34.4 / 48.2 / 33.5	1.9 GB

Acknowledgement

This repo is partly based on the open-source release from STCAT and the evaluation metric implementation is borrowed from TubeDETR for a fair comparison.

Citation

⭐ If you find this repository useful, please consider giving it a star and citing it:

@inproceedings{gu2025knowing,
  title={Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding},
  author={Gu, Xin and Shen, Yaojie and Luo, Chenxi and Luo, Tiejian and Huang, Yan and Lin, Yuewei and Fan, Heng and Zhang, Libo},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
datasets		datasets
engine		engine
experiments		experiments
figures		figures
models		models
scripts		scripts
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

Illustration of Idea

Framework

Implementation

Dataset Preparation

Model Preparation

Requirements

Training

Evaluation

Pretrained Model Weights

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

Illustration of Idea

Framework

Implementation

Dataset Preparation

Model Preparation

Requirements

Training

Evaluation

Pretrained Model Weights

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages