MDI-Benchmark

Benchmarking Real-World Personalization in
Large Multimodal Models

¹Beijing University of Posts and Telecommunications, ²Huazhong University of Science and Technology

*Equal contribution
†Corresponding author

Introduction

Current benchmarks focus primarily on technical metrics for specific tasks, neglecting two critical research questions:

Q1: Can these LMMs truly truly align with the actual needs of humans in real-world scenarios?

Q2: Can these LMMs subsequently address the diverse needs of distinct groups?

To tackle these challenges, we introduce a novel MDI-benchmark, which encompasses various real-world scenarios, different problem complexities, and diverse age groups. In detail, MDI-benchmark consists of more than 500 real-world images and 1.2k human-posed questions. it covers 6 major scenarios of human life: Architecture, Education, Housework, Social Services, Sport, and Transport.

We hope our research will advance the application of multimodal large models in real-world scenarios and pave the way for the development of multi-dimensional customization.

The MDI-benchmark comprises six real-world multimodal scenarios, each comprising three sub-domains.

Overview

The MDI-benchmark sample design highlights the real-world complexity of information, scene variability, and age differences.

People's information concerns often vary by scenario. As shown in Figure, a family buying a new house may focus on practical issues that are closely related to them, such as kitchen type, garage capacity, and bedroom amenities. Spectators at sports events may concern themselves with game details, player achievements, and game progress.

To address these differences, we design a unique three-dimensional hierarchical structure for the MDI-Benchmark, incorporating scenarios, age, and problem complexity. This structure provides a comprehensive evaluation framework that not only tests the model's basic problem-solving abilities but also examines its adaptability to specific human contexts and scenarios. This multi-level, multi-angle testing method aids in developing more intelligent AI systems aligned with user needs.

Architecture

It covers different scenarios that humans may face when completing construction activities in real life, including the following three categories: work scenarios, tool use and drawing understanding.

Education

It covers the different scenarios that human beings may face when they receive education activities in real life, including the following three categories: campus environment, classroom learning and curriculum teaching.

Housework

It covers the different scenarios that humans may face when completing housework in real life, including the following three categories: household arrangements, household activities and household appliances.

Sport

It covers the physical activities that humans face in real life, including the following three categories: powerlifting, race and ball.

Transport

It covers travel situations that humans face in real life, including the following three categories: signpost, rail transit and airport.

Leaderboard on MDI-Benchmark (testmini)

Accuracy scores on the testmini subset (612 examples) of MDI-Benchmark.
Among them, those marked by red characters are closed-source models, and those marked by green characters are open source models.

#	Model	Source	Date	Avg(Level 1)	Arc(Level 1)	Edu(Level 1)	Hou(Level 1)	Soc(Level 1)	Spo(Level 1)	Tra(Level 1)
1	GPT-4o 🥇	Link	2024-05	87.46%	76.47%	94.12%	92.16%	90.20%	86.27%	94.12%
2	GPT-4V 🥈	Link	2024-04	87.46%	86.27%	92.16%	86.27%	90.20%	88.24%	90.20%
3	Gemini 1.5 Pro 🥉	Link	2024-05	82.32%	68.63%	92.16%	76.47%	88.24%	86.27%	90.20%
4	LLaVA-NeXT-110B	Link	2024-05	79.10%	60.78%	92.16%	78.43%	84.31%	78.43%	88.24%
5	LLaVA-NeXT-72B	Link	2024-04	76.21%	68.63%	88.24%	80.39%	82.35%	70.59%	74.51%
6	MiniCPM-LLaMA3-V 2.5	Link	2024-05	72.67%	52.94%	86.27%	70.59%	82.35%	70.59%	80.39%
7	DeepSeek-VL-7B	Link	2024-03	68.49%	49.02%	70.59%	74.51%	80.39%	62.75%	80.39%
8	Phi3-Vision-4.2B	Link	2024-05	67.20%	50.98%	76.47%	60.78%	80.39%	62.75%	78.43%
9	mPLUG-Owl2-7B	Link	2023-10	64.63%	49.02%	70.59%	74.51%	70.59%	58.82%	70.59%
10	CogVLM-chat	Link	2023-11	60.77%	49.02%	72.55%	62.75%	56.86%	68.63%	60.78%
11	DeepSeek-VL-1.3B	Link	2024-03	58.20%	45.10%	56.86%	66.67%	56.86%	66.67%	62.75%
12	Qwen-VL-Plus	Link	2024-01	56.59%	43.14%	64.71%	62.75%	78.43%	50.98%	45.10%
13	CogAgent-vqa	Link	2023-12	49.52%	35.29%	45.10%	66.67%	54.90%	56.86%	43.14%
14	LLaVA-NeXT-7B	Link	2024-03	43.09%	31.37%	52.94%	43.14%	49.02%	39.22%	47.06%

#	Model	Source	Date	Avg(Level 2)	Arc(Level 2)	Edu(Level 2)	Hou(Level 2)	Soc(Level 2)	Spo(Level 2)	Tra(Level 2)
1	GPT-4o 🥇	Link	2024-05	69.45%	70.59%	70.59%	78.43%	82.35%	54.90%	66.67%
2	GPT-4V 🥈	Link	2024-04	62.38%	72.55%	70.59%	74.51%	60.78%	45.10%	56.86%
3	Gemini 1.5 Pro 🥉	Link	2024-05	55.95%	52.94%	56.86%	54.90%	74.51%	43.14%	58.82%
4	LLaVA-NeXT-110B	Link	2024-05	52.09%	66.67%	56.86%	54.90%	64.71%	31.37%	43.14%
5	LLaVA-NeXT-72B	Link	2024-04	51.13%	66.67%	54.90%	52.94%	60.78%	33.33%	43.14%
6	mPLUG-Owl2-7B	Link	2023-10	40.51%	41.18%	41.18%	47.06%	39.22%	29.41%	49.02%
7	MiniCPM-LLaMA3-V 2.5	Link	2024-05	39.23%	45.10%	49.02%	49.02%	31.37%	27.45%	37.25%
8	CogVLM-chat	Link	2023-11	38.91%	49.02%	33.33%	43.14%	41.18%	27.45%	43.14%
9	DeepSeek-VL-7B	Link	2024-03	35.69%	41.18%	33.33%	39.22%	41.18%	21.57%	41.18%
10	Phi3-Vision-4.2B	Link	2024-05	34.41%	37.25%	33.33%	41.18%	43.14%	21.57%	33.33%
11	DeepSeek-VL-1.3B	Link	2024-03	34.41%	35.29%	29.41%	29.41%	39.22%	27.45%	49.02%
12	CogAgent-vqa	Link	2023-12	32.80%	31.37%	35.29%	35.29%	37.25%	25.49%	35.29%
13	Qwen-VL-Plus	Link	2024-01	30.55%	35.29%	41.18%	37.25%	25.49%	23.53%	23.53%
14	LLaVA-NeXT-7B	Link	2024-03	24.12%	35.29%	13.73%	37.25%	23.53%	9.80%	27.45%

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Results on Existing Foundation Models

The average performance of different LMMs on different difficulty levels of the MDI-Benchmark.

The average accuracy and variance of LLMs across six domains at Level 1.

The average accuracy and variance of LLMs across six domains at Level 2.

Performance of different LMMs across the age dimension.

Question Example

Example of GPT-4o Architecture Scenario Correct Answers.

Example of GPT-4o Education Scenario Correct Answers.

Example of GPT-4o Housework Scenario Correct Answers.

Example of GPT-4o Social Service Scenario Correct Answers.

Example of GPT-4o Sport Scenario Correct Answers.

Example of GPT-4o Trans Scenario Correct Answers.

BibTeX

@article{zhang2024multi, title={Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models}, author={Zhang, YiFan and Lei, Shanglin and Qiao, Runqi and GongQue, Zhuoma and Song, Xiaoshuai and Dong, Guanting and Tan, Qiuna and Wei, Zhe and Yang, Peiqing and Tian, Ye and others}, journal={arXiv preprint arXiv:2412.12606}, year={2024} }

Multi-Dimensional Insights

Benchmarking Real-World Personalization in
Large Multimodal Models

MDI-Benchmark

Introduction

Real-world Scenarios

Overview

Architecture

Education

Housework

Sport

Transport

Experiment Results

Leaderboard on MDI-Benchmark (testmini)

Results on Existing Foundation Models

Example

Question Example

BibTeX