π Website: vectorinstitute.github.io/humanibench Β |Β π Paper: arxiv.org/abs/2505.11454 Β |Β π Dataset: Hugging Face
As multimodal generative AI systems become increasingly integrated into human-centered applications, evaluating their alignment with human values has become critical.
HumaniBench is the first comprehensive benchmark designed to evaluate Large Multimodal Models (LMMs) on seven Human-Centered AI (HCAI) principles:
- Fairness
- Ethics
- Understanding
- Reasoning
- Language Inclusivity
- Empathy
- Robustness
This repository provides code and scripts for evaluating LMMs across 7 human-aligned tasks.
- π· 32,000+ Real-World ImageβQuestion Pairs
- β Human-Verified Ground Truth Annotations
- π Multilingual QA Support (10+ languages)
- π§ Open and Closed-Ended VQA Formats
- π§ͺ Visual Robustness & Bias Stress Testing
- π Chain-of-Thought Reasoning + Perceptual Grounding
| Task | Focus | Folder |
|---|---|---|
| Task 1: Scene Understanding | Visual reasoning + bias/toxicity analysis in social attributes (gender, age, occupation, etc.) | src/task1_scene_understanding |
| Task 2: Instance Identity | Visual reasoning in culturally rich, socially grounded settings | src/task2_instance_identity |
| Task 3: Multiple Choice QA | Structured attribute recognition via multi-choice questions | src/task3_multiplechoice_vqa |
| Task 4: Multilingual Visual QA | VQA across 10+ languages, including low-resource ones | src/task4_multilingual |
| Task 5: Visual Grounding | Bounding box localization of socially salient regions | src/task5_visual_grounding |
| Task 6: Empathetic Captioning | Human-style emotional captioning evaluation | src/task6_empathetic_captioning |
| Task 7: Image Resilience | Robustness testing via image perturbations | src/task7_image_resilience |
π Each task folder includes a README with setup instructions, task structure, and metrics.
Three-stage process:
-
Data Collection Curated from global news imagery, tagged by social attributes (age, gender, race, occupation, sport)
-
Annotation GPT-4oβassisted labeling + human expert verification
-
Evaluation Comprehensive scoring across:
- Accuracy
- Fairness
- Robustness
- Empathy
- Faithfulness
- π Bias persists, especially across gender and race
- π Multilingual gaps affect low-resource language performance
- β€οΈ Empathy and ethics vary significantly by model family
- π§ Chain-of-Thought reasoning improves performance but doesnβt fully mitigate bias
- π§ͺ Robustness tests reveal fragility to noise, occlusion, and blur
The development environment can be set up using uv. Hence, make sure it is installed and then run:
uv sync
source .venv/bin/activateIn order to install dependencies for testing (codestyle, unit tests, integration tests), run:
uv sync --dev
source .venv/bin/activateIn order to exclude installation of packages from a specific group (e.g. docs), run:
uv sync --no-group docsIf you're coming from poetry then you'll notice that the virtual environment
is actually stored in the project root folder and is by default named as .venv.
The other important note is that while poetry uses a "flat" layout of the project,
uv opts for the the "src" layout. (For more info, see here)
If you use HumaniBench or this evaluation suite in your work, please cite:
@article{raza2025humanibench,
title={Humanibench: A human-centric framework for large multimodal models evaluation},
author={Raza, Shaina and Narayanan, Aravind and Khazaie, Vahid Reza and Vayani, Ashmal and Radwan, Ahmed Y and Chettiar, Mukund S and Singh, Amandeep and Shah, Mubarak and Pandya, Deval},
journal={arXiv preprint arXiv:2505.11454},
year={2025}
}
Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute (vectorinstitute.ai/#partners).
This research was funded by the European Union's Horizon Europe research and innovation programme under the AIXPERT project (Grant Agreement No. 101214389), which aims to develop an agentic, multi-layered, GenAI-powered framework for creating explainable, accountable, and transparent AI systems.
For questions, collaborations, or dataset access requests, please open an issue in this repository or contact the corresponding author at shaina.raza@vectorinstitute.ai, as listed in the paper.
We invite researchers, developers, and policymakers to explore, evaluate, and extend HumaniBench. π
