🔮 Welcome to the official code repository of our TA-STVG. We're excited to share our work with you.
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
Xin Gu, Yaojie Shen, Chenxi Luo, Tiejian Luo, Yan Huang, Yuewei Lin, Heng Fan
(
International Conference on Learning Representations (ICLR), 2025. (Oral)
[arXiv]
💡 Motivation: In the decoder procedure, object queries expect to
learn target position information from multimodal
features. If object queries know the target from the
very beginning, or in other words, they know what
to learn, they can employ target-specific cues as a
prior to guide themselves for better interaction with
the multimodal features, which benefits learning
more discriminative features for better localization.
Figure: Comparison between existing Transformer-based STVG methods applying zero-initialized
queries for STVG in (a) and our proposed Target-Aware Transformer-based STVG generating queries
with target-aware cues from video and text for STVG in (b)
Figure: Overview of TA-STVG, which exploits target-specific information from the video and text (i.e., features from multimodal encoder) for generating spatial and temporal object queries for STVG. More details can be seen in the paper.
The used datasets are placed in data folder with the following structure.
data
|_ vidstg
| |_ videos
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ annos
| | |_ train.json
| | |_ ...
| |_ sent_annos
| | |_ train_annotations.json
| | |_ ...
| |_ data_cache
| | |_ ...
|_ hc-stvg2
| |_ v2_video
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ annos
| | |_ train.json
| | |_ test.json
| | data_cache
| | |_ ...
|_ hc-stvg
| |_ v1_video
| | |_ [video name 0].mp4
| | |_ [video name 1].mp4
| | |_ ...
| |_ annos
| | |_ train.json
| | |_ test.json
| | data_cache
| | |_ ...
First, download the data file, which contains annotations for three datasets. Then, download the videos of the following three datasets respectively and put them into the specified path. The download link for the above-mentioned document is as follows:
HC-STVG (v1_video), HC-STVG2 (v2_video), VidSTG (videos)
The used datasets are placed in model_zoo folder
ResNet-101, VidSwin-T, roberta-base
The code has been tested and verified using PyTorch 2.0.1 and CUDA 11.7. However, compatibility with other versions is also likely. To install the necessary requirements, please use the commands provided below:
pip3 install -r requirements.txt
apt install ffmpeg -yPlease utilize the script provided below:
# run for HC-STVG
python3 -m torch.distributed.launch \
--nnodes $WORKER_NUM \
--node_rank $NODE_ID \
--nproc_per_node $WORKER_GPU \
--master_addr $WORKER_0_HOST \
--master_port $PORT \
scripts/train_net.py \
--config-file "experiments/hcstvg.yaml" \
OUTPUT_DIR output/hcstvg \
TENSORBOARD_DIR output/hcstvg
# run for HC-STVG2
python3 -m torch.distributed.launch \
--nnodes $WORKER_NUM \
--node_rank $NODE_ID \
--nproc_per_node $WORKER_GPU \
--master_addr $WORKER_0_HOST \
--master_port $PORT \
scripts/train_net.py \
--config-file "experiments/hcstvg2.yaml" \
OUTPUT_DIR output/hcstvg2 \
TENSORBOARD_DIR output/hcstvg2
# run for VidSTG
python3 -m torch.distributed.launch \
--nnodes $WORKER_NUM \
--node_rank $NODE_ID \
--nproc_per_node $WORKER_GPU \
--master_addr $WORKER_0_HOST \
--master_port $PORT \
scripts/train_net.py \
--config-file "experiments/vidstg.yaml" \
OUTPUT_DIR output/vidstg \
TENSORBOARD_DIR output/vidstgFor additional training options, such as utilizing different hyper-parameters, please adjust the configurations as needed:
experiments/hcstvg.yaml, experiments/hcstvg2.yaml and experiments/vidstg.yaml. All trainings are completed on 32 A100 GPUs.
Please utilize the script provided below:
# run for HC-STVG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/hcstvg.yaml" \
MODEL.WEIGHT [Pretrained Model Weights] \
OUTPUT_DIR output/hcstvg
# run for HC-STVG2
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/hcstvg2.yaml" \
MODEL.WEIGHT [Pretrained Model Weights] \
OUTPUT_DIR output/hcstvg2
# run for VidSTG
python3 -m torch.distributed.launch \
--nproc_per_node=8 \
scripts/test_net.py \
--config-file "experiments/vidstg.yaml" \
MODEL.WEIGHT [Pretrained Model Weights] \
OUTPUT_DIR output/vidstgWe provide our trained checkpoints for results reproducibility.
| Dataset | Resolution | Url | m_tIoU/m_vIoU/vIoU@0.3/vIoU@0.5 | Size |
|---|---|---|---|---|
| HC-STVG | 420 | Model | 53.0 / 39.1 / 63.1 / 36.8 | 1.9 GB |
| HC-STVG2 | 420 | Model | 60.4 / 40.2/ 65.8 / 36.7 | 1.9 GB |
| VidSTG | 420 | Model | 51.7 / 34.4 / 48.2 / 33.5 | 1.9 GB |
This repo is partly based on the open-source release from STCAT and the evaluation metric implementation is borrowed from TubeDETR for a fair comparison.
⭐ If you find this repository useful, please consider giving it a star and citing it:
@inproceedings{gu2025knowing,
title={Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding},
author={Gu, Xin and Shen, Yaojie and Luo, Chenxi and Luo, Tiejian and Huang, Yan and Lin, Yuewei and Fan, Heng and Zhang, Libo},
booktitle={International Conference on Learning Representations},
year={2025}
}