HumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

🌐 Website: vectorinstitute.github.io/humanibench | 📄 Paper: arxiv.org/abs/2505.11454 | 📊 Dataset: Hugging Face

🧠 Overview

As multimodal generative AI systems become increasingly integrated into human-centered applications, evaluating their alignment with human values has become critical.

HumaniBench is the first comprehensive benchmark designed to evaluate Large Multimodal Models (LMMs) on seven Human-Centered AI (HCAI) principles:

Fairness
Ethics
Understanding
Reasoning
Language Inclusivity
Empathy
Robustness

This repository provides code and scripts for evaluating LMMs across 7 human-aligned tasks.

📦 Features

📷 32,000+ Real-World Image–Question Pairs
✅ Human-Verified Ground Truth Annotations
🌐 Multilingual QA Support (10+ languages)
🧠 Open and Closed-Ended VQA Formats
🧪 Visual Robustness & Bias Stress Testing
📑 Chain-of-Thought Reasoning + Perceptual Grounding

📂 Evaluation Tasks Overview

Task	Focus	Folder
Task 1: Scene Understanding	Visual reasoning + bias/toxicity analysis in social attributes (gender, age, occupation, etc.)	`src/task1_scene_understanding`
Task 2: Instance Identity	Visual reasoning in culturally rich, socially grounded settings	`src/task2_instance_identity`
Task 3: Multiple Choice QA	Structured attribute recognition via multi-choice questions	`src/task3_multiplechoice_vqa`
Task 4: Multilingual Visual QA	VQA across 10+ languages, including low-resource ones	`src/task4_multilingual`
Task 5: Visual Grounding	Bounding box localization of socially salient regions	`src/task5_visual_grounding`
Task 6: Empathetic Captioning	Human-style emotional captioning evaluation	`src/task6_empathetic_captioning`
Task 7: Image Resilience	Robustness testing via image perturbations	`src/task7_image_resilience`

🔍 Each task folder includes a README with setup instructions, task structure, and metrics.

🧬 Pipeline

Three-stage process:

Data Collection Curated from global news imagery, tagged by social attributes (age, gender, race, occupation, sport)
Annotation GPT-4o–assisted labeling + human expert verification
Evaluation Comprehensive scoring across:
- Accuracy
- Fairness
- Robustness
- Empathy
- Faithfulness

🔑 Key Insights

🔍 Bias persists, especially across gender and race
🌐 Multilingual gaps affect low-resource language performance
❤️ Empathy and ethics vary significantly by model family
🧠 Chain-of-Thought reasoning improves performance but doesn’t fully mitigate bias
🧪 Robustness tests reveal fragility to noise, occlusion, and blur

🧑🏿‍💻 Developing

Installing dependencies

The development environment can be set up using uv. Hence, make sure it is installed and then run:

uv sync
source .venv/bin/activate

In order to install dependencies for testing (codestyle, unit tests, integration tests), run:

uv sync --dev
source .venv/bin/activate

In order to exclude installation of packages from a specific group (e.g. docs), run:

uv sync --no-group docs

If you're coming from poetry then you'll notice that the virtual environment is actually stored in the project root folder and is by default named as .venv. The other important note is that while poetry uses a "flat" layout of the project, uv opts for the the "src" layout. (For more info, see here)

📚 Citation

If you use HumaniBench or this evaluation suite in your work, please cite:

        @article{raza2025humanibench,
            title={Humanibench: A human-centric framework for large multimodal models evaluation},
            author={Raza, Shaina and Narayanan, Aravind and Khazaie, Vahid Reza and Vayani, Ashmal and Radwan, Ahmed Y and Chettiar, Mukund S and Singh, Amandeep and Shah, Mubarak and Pandya, Deval},
            journal={arXiv preprint arXiv:2505.11454},
            year={2025}
          }

🙏 Acknowledgments

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute (vectorinstitute.ai/#partners).

This research was funded by the European Union's Horizon Europe research and innovation programme under the AIXPERT project (Grant Agreement No. 101214389), which aims to develop an agentic, multi-layered, GenAI-powered framework for creating explainable, accountable, and transparent AI systems.

📬 Contact

For questions, collaborations, or dataset access requests, please open an issue in this repository or contact the corresponding author at shaina.raza@vectorinstitute.ai, as listed in the paper.

⚡ HumaniBench promotes trustworthy, fair, and human-centered multimodal AI.

We invite researchers, developers, and policymakers to explore, evaluate, and extend HumaniBench. 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 318 Commits
.github		.github
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
codecov.yml		codecov.yml
croissant.json		croissant.json
datasheet.pdf		datasheet.pdf
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

🧠 Overview

📦 Features

📂 Evaluation Tasks Overview

🧬 Pipeline

🔑 Key Insights

🧑🏿‍💻 Developing

Installing dependencies

📚 Citation

🙏 Acknowledgments

📬 Contact

⚡ HumaniBench promotes trustworthy, fair, and human-centered multimodal AI.

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

🧠 Overview

📦 Features

📂 Evaluation Tasks Overview

🧬 Pipeline

🔑 Key Insights

🧑🏿‍💻 Developing

Installing dependencies

📚 Citation

🙏 Acknowledgments

📬 Contact

⚡ HumaniBench promotes trustworthy, fair, and human-centered multimodal AI.

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages