Skip to content

synlp/MABSA-LLMPlug

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MABSA-LLMPlug

Official implementation for the IEEE TNNLS paper:

Multimodal Aspect-based Sentiment Analysis with Plugin-enhanced Large Language Models

Yuanhe Tian, Yan Song, and Yongdong Zhang

This repository provides a plugin-enhanced multimodal large language model for multimodal aspect-based sentiment analysis (MABSA). The model augments a LLaVA-style multimodal LLM with lightweight visual and textual plugins, and uses a memory-based hub to inject task-specific multimodal knowledge into the LLM.

Overview

MABSA aims to predict the sentiment polarity toward a given aspect term from paired text and image inputs. The proposed approach improves multimodal LLM adaptation for MABSA by:

  • encoding salient visual object regions with a visual plugin;
  • encoding word-level textual relations with a text plugin;
  • modeling plugin knowledge with attentive graph convolutional networks;
  • integrating visual and textual plugin outputs through a memory-based hub;
  • jointly training the plugins, hub, and multimodal LLM for task-specific adaptation.

Repository Structure

MABSA-LLMPlug/
|-- llava/                  # LLaVA-based model, training, serving, and evaluation code
|-- plugins/                # Visual plugin, text plugin, GCN module, and hub
|-- scripts/                # DeepSpeed configuration files
|-- train_mem.py            # Main training entry point
|-- test_plugin.py          # Inference and evaluation entry point
|-- README.md
`-- LICENSE

Environment

This codebase follows the LLaVA training stack and additionally uses the plugin modules in plugins/. Please prepare a LLaVA-compatible Python environment with PyTorch, Transformers, DeepSpeed, FlashAttention, OpenCV, scikit-learn, NLTK, Safetensors, and the other packages imported by the repository.

The main external checkpoints used by the training script are:

  • a LLaVA-compatible multimodal LLM checkpoint, passed by --model_name_or_path;
  • the CLIP vision tower used by LLaVA, passed by --vision_tower;
  • a ViT checkpoint for the visual plugin, passed by --visual_plugin_model_path;
  • a BERT checkpoint for the text plugin, passed by --text_plugin_model_path.

Data Preparation

The training data should be a JSON file in the LLaVA conversation format, augmented with MABSA labels and plugin features. Each instance is expected to include image information, conversations, the gold sentiment label, text features, and visual features.

For evaluation with test_plugin.py, each instance should provide the fields used by the inference script, including image_path, sentence, aspect term, label, text_features, and visual_features.

Training

Update the following paths before training:

  • path_to_llava: LLaVA-compatible base checkpoint;
  • path_to_train_file: training JSON file;
  • path_to_image_folder: folder containing the image files;
  • path_to_clip: CLIP vision tower checkpoint;
  • path_to_output_dir: output directory for checkpoints;
  • path_to_vit_base: ViT checkpoint for the visual plugin;
  • path_to_bert: BERT checkpoint for the text plugin.
deepspeed --include localhost --master_port=12345 train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path path_to_llava \
    --version v1 \
    --data_path path_to_train_file \
    --image_folder path_to_image_folder \
    --vision_tower path_to_clip \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir path_to_output_dir \
    --num_train_epochs 5 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 5 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --visual_plugin \
    --visual_plugin_model_path path_to_vit_base \
    --text_plugin \
    --text_plugin_model_path path_to_bert \
    --gcn_layer_num 3 \
    --use_hub \
    --hub_memory_size 20 \
    --hub_output_size 8 \
    --hub_hidden_size 768

Checkpoints and trainer states are saved to path_to_output_dir.

Evaluation

After training, evaluate a checkpoint with:

python test_plugin.py \
    --model-path path_to_trained_checkpoint \
    --data_home path_to_image_folder \
    --test_data path_to_test_file \
    --output_dir path_to_eval_output_dir \
    --outfile results.json \
    --visual_plugin_model_path path_to_vit_base \
    --text_plugin_model_path path_to_bert

If the checkpoint requires a separate base model, also pass --model-base path_to_llava.

The script saves predictions and aggregate metrics, including accuracy and macro-F1, to the specified output JSON file.

Citation

If you use this code or find our work helpful, please cite:

@article{tian2025multimodal,
  title={Multimodal aspect-based sentiment analysis with plugin-enhanced large language models},
  author={Tian, Yuanhe and Song, Yan and Zhang, Yongdong},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  year={2025},
  publisher={IEEE}
}

License

This project is released under the MIT License. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors