TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

Abstract

Vision-Language Models (VLMs) have shown remarkable potential in advancing autonomous driving by leveraging multi-modal fusion in order to enhance scene perception, reasoning, and decision-making. Despite their potential, existing models suffer from computational overhead and inefficient integration of multi-view sensor data that make them impractical for real-time deployment in safety-critical autonomous driving applications. To address these shortcomings, this paper is devoted to designing a lightweight VLM called TS-VLM, which incorporates a novel Text-Guided SoftSort Pooling (TGSSP) module. By resorting to semantics of the input queries, TGSSP ranks and fuses visual features from multiple views, enabling dynamic and query-aware multi-view aggregation without reliance on costly attention mechanisms. This design ensures the query-adaptive prioritization of semantically related views, which leads to improved contextual accuracy in multi-view reasoning for autonomous driving. Extensive evaluations on the DriveLM benchmark demonstrate that, on the one hand, TS-VLM outperforms state-of-the-art models with a BLEU-4 score of 56.82, METEOR of 41.91, ROUGE-L of 74.64, and CIDEr of 3.39. On the other hand, TS-VLM reduces computational cost by up to 90%, where the smallest version contains only 20.1 million parameters, making it more practical for real-time deployment in autonomous vehicles.

Usage

Installation

git clone https://github.com/AiX-Lab-UWO/TS-VLM.git
conda env create -f environment.yaml
conda activate TSVLM

Data Preparation

Step 1: Download the dataset from the below link: data
Step 2: Organize the downloaded files in the following way.

├─ data
├─ train.py
├─ eval.py
├─ TSVLM_dataset.py
├─ TSVLM.py
...

Train

The training results will be stored at ./results. For additional training hyperparameter options, please refer to the full argument list in train.py.

python train.py --lm T5-Tiny

Evaluation

To evaluate the trained model, use the following command. For additional training hyperparameter options, please refer to the full argument list in eval.py.

python eval.py --model-name your_modelname

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

Abstract

Usage

Installation

Data Preparation

Train

Evaluation

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
TSVLM.py		TSVLM.py
TSVLM_dataset.py		TSVLM_dataset.py
environment.yaml		environment.yaml
eval.py		eval.py
train.py		train.py

License

lc542/TS-VLM

Folders and files

Latest commit

History

Repository files navigation

TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

Abstract

Usage

Installation

Data Preparation

Train

Evaluation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages