Official implementation for the IEEE TNNLS paper:
Multimodal Aspect-based Sentiment Analysis with Plugin-enhanced Large Language Models
Yuanhe Tian, Yan Song, and Yongdong Zhang
This repository provides a plugin-enhanced multimodal large language model for multimodal aspect-based sentiment analysis (MABSA). The model augments a LLaVA-style multimodal LLM with lightweight visual and textual plugins, and uses a memory-based hub to inject task-specific multimodal knowledge into the LLM.
MABSA aims to predict the sentiment polarity toward a given aspect term from paired text and image inputs. The proposed approach improves multimodal LLM adaptation for MABSA by:
- encoding salient visual object regions with a visual plugin;
- encoding word-level textual relations with a text plugin;
- modeling plugin knowledge with attentive graph convolutional networks;
- integrating visual and textual plugin outputs through a memory-based hub;
- jointly training the plugins, hub, and multimodal LLM for task-specific adaptation.
MABSA-LLMPlug/
|-- llava/ # LLaVA-based model, training, serving, and evaluation code
|-- plugins/ # Visual plugin, text plugin, GCN module, and hub
|-- scripts/ # DeepSpeed configuration files
|-- train_mem.py # Main training entry point
|-- test_plugin.py # Inference and evaluation entry point
|-- README.md
`-- LICENSE
This codebase follows the LLaVA training stack and additionally uses the plugin modules in plugins/.
Please prepare a LLaVA-compatible Python environment with PyTorch, Transformers, DeepSpeed, FlashAttention, OpenCV, scikit-learn, NLTK, Safetensors, and the other packages imported by the repository.
The main external checkpoints used by the training script are:
- a LLaVA-compatible multimodal LLM checkpoint, passed by
--model_name_or_path; - the CLIP vision tower used by LLaVA, passed by
--vision_tower; - a ViT checkpoint for the visual plugin, passed by
--visual_plugin_model_path; - a BERT checkpoint for the text plugin, passed by
--text_plugin_model_path.
The training data should be a JSON file in the LLaVA conversation format, augmented with MABSA labels and plugin features. Each instance is expected to include image information, conversations, the gold sentiment label, text features, and visual features.
For evaluation with test_plugin.py, each instance should provide the fields used by the inference script, including image_path, sentence, aspect term, label, text_features, and visual_features.
Update the following paths before training:
path_to_llava: LLaVA-compatible base checkpoint;path_to_train_file: training JSON file;path_to_image_folder: folder containing the image files;path_to_clip: CLIP vision tower checkpoint;path_to_output_dir: output directory for checkpoints;path_to_vit_base: ViT checkpoint for the visual plugin;path_to_bert: BERT checkpoint for the text plugin.
deepspeed --include localhost --master_port=12345 train_mem.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path path_to_llava \
--version v1 \
--data_path path_to_train_file \
--image_folder path_to_image_folder \
--vision_tower path_to_clip \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir path_to_output_dir \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 5 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--visual_plugin \
--visual_plugin_model_path path_to_vit_base \
--text_plugin \
--text_plugin_model_path path_to_bert \
--gcn_layer_num 3 \
--use_hub \
--hub_memory_size 20 \
--hub_output_size 8 \
--hub_hidden_size 768Checkpoints and trainer states are saved to path_to_output_dir.
After training, evaluate a checkpoint with:
python test_plugin.py \
--model-path path_to_trained_checkpoint \
--data_home path_to_image_folder \
--test_data path_to_test_file \
--output_dir path_to_eval_output_dir \
--outfile results.json \
--visual_plugin_model_path path_to_vit_base \
--text_plugin_model_path path_to_bertIf the checkpoint requires a separate base model, also pass --model-base path_to_llava.
The script saves predictions and aggregate metrics, including accuracy and macro-F1, to the specified output JSON file.
If you use this code or find our work helpful, please cite:
@article{tian2025multimodal,
title={Multimodal aspect-based sentiment analysis with plugin-enhanced large language models},
author={Tian, Yuanhe and Song, Yan and Zhang, Yongdong},
journal={IEEE Transactions on Neural Networks and Learning Systems},
year={2025},
publisher={IEEE}
}This project is released under the MIT License. See LICENSE for details.