Logo Multi-Dimensional Insights

Benchmarking Real-World Personalization in
Large Multimodal Models

1Beijing University of Posts and Telecommunications, 2Huazhong University of Science and Technology

*Equal contribution
†Corresponding author

An overview of the six scenarios of MDI-Benchmark in human life, each of which contains the needs of different age groups.

Logo MDI-Benchmark


Introduction

Current benchmarks focus primarily on technical metrics for specific tasks, neglecting two critical research questions:

Q1: Can these LMMs truly truly align with the actual needs of humans in real-world scenarios?

Q2: Can these LMMs subsequently address the diverse needs of distinct groups?

To tackle these challenges, we introduce a novel MDI-benchmark, which encompasses various real-world scenarios, different problem complexities, and diverse age groups. In detail, MDI-benchmark consists of more than 500 real-world images and 1.2k human-posed questions. it covers 6 major scenarios of human life: Architecture, Education, Housework, Social Services, Sport, and Transport.

We hope our research will advance the application of multimodal large models in real-world scenarios and pave the way for the development of multi-dimensional customization.

The MDI-benchmark comprises six real-world multimodal scenarios, each comprising three sub-domains.


Real-world Scenarios


Overview

The MDI-benchmark sample design highlights the real-world complexity of information, scene variability, and age differences.

People's information concerns often vary by scenario. As shown in Figure, a family buying a new house may focus on practical issues that are closely related to them, such as kitchen type, garage capacity, and bedroom amenities. Spectators at sports events may concern themselves with game details, player achievements, and game progress.

To address these differences, we design a unique three-dimensional hierarchical structure for the MDI-Benchmark, incorporating scenarios, age, and problem complexity. This structure provides a comprehensive evaluation framework that not only tests the model's basic problem-solving abilities but also examines its adaptability to specific human contexts and scenarios. This multi-level, multi-angle testing method aids in developing more intelligent AI systems aligned with user needs.


Architecture

It covers different scenarios that humans may face when completing construction activities in real life, including the following three categories: work scenarios, tool use and drawing understanding.

Education

It covers the different scenarios that human beings may face when they receive education activities in real life, including the following three categories: campus environment, classroom learning and curriculum teaching.

Housework

It covers the different scenarios that humans may face when completing housework in real life, including the following three categories: household arrangements, household activities and household appliances.

It covers the social services that human beings can provide in real life, including the following three categories: travel, shopping and communal facilities.

Sport

It covers the physical activities that humans face in real life, including the following three categories: powerlifting, race and ball.

Transport

It covers travel situations that humans face in real life, including the following three categories: signpost, rail transit and airport.

Architecture

Education

Housework

Social Service

Sport

Transport

Experiment Results

Leaderboard on MDI-Benchmark (testmini)

Accuracy scores on the testmini subset (612 examples) of MDI-Benchmark.
Among them, those marked by red characters are closed-source models, and those marked by green characters are open source models.

# Model Source Date Avg(Level 1) Arc(Level 1) Edu(Level 1) Hou(Level 1) Soc(Level 1) Spo(Level 1) Tra(Level 1)
1 GPT-4o 🥇 Link 2024-05 87.46% 76.47% 94.12% 92.16% 90.20% 86.27% 94.12%
2 GPT-4V 🥈 Link 2024-04 87.46% 86.27% 92.16% 86.27% 90.20% 88.24% 90.20%
3 Gemini 1.5 Pro 🥉 Link 2024-05 82.32% 68.63% 92.16% 76.47% 88.24% 86.27% 90.20%
4 LLaVA-NeXT-110B Link 2024-05 79.10% 60.78% 92.16% 78.43% 84.31% 78.43% 88.24%
5 LLaVA-NeXT-72B Link 2024-04 76.21% 68.63% 88.24% 80.39% 82.35% 70.59% 74.51%
6 MiniCPM-LLaMA3-V 2.5 Link 2024-05 72.67% 52.94% 86.27% 70.59% 82.35% 70.59% 80.39%
7 DeepSeek-VL-7B Link 2024-03 68.49% 49.02% 70.59% 74.51% 80.39% 62.75% 80.39%
8 Phi3-Vision-4.2B Link 2024-05 67.20% 50.98% 76.47% 60.78% 80.39% 62.75% 78.43%
9 mPLUG-Owl2-7B Link 2023-10 64.63% 49.02% 70.59% 74.51% 70.59% 58.82% 70.59%
10 CogVLM-chat Link 2023-11 60.77% 49.02% 72.55% 62.75% 56.86% 68.63% 60.78%
11 DeepSeek-VL-1.3B Link 2024-03 58.20% 45.10% 56.86% 66.67% 56.86% 66.67% 62.75%
12 Qwen-VL-Plus Link 2024-01 56.59% 43.14% 64.71% 62.75% 78.43% 50.98% 45.10%
13 CogAgent-vqa Link 2023-12 49.52% 35.29% 45.10% 66.67% 54.90% 56.86% 43.14%
14 LLaVA-NeXT-7B Link 2024-03 43.09% 31.37% 52.94% 43.14% 49.02% 39.22% 47.06%

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Results on Existing Foundation Models

Example

Question Example

BibTeX

@misc{zhang2024multidimensionalinsightsbenchmarkingrealworld,
                title={Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models}, 
                author={YiFan Zhang and Shanglin Lei and Runqi Qiao and Zhuoma GongQue and Xiaoshuai Song and Guanting Dong and Qiuna Tan and Zhe Wei and Peiqing Yang and Ye Tian and Yadong Xue and Xiaofei Wang and Honggang Zhang},
                year={2024},
                eprint={2412.12606},
                archivePrefix={arXiv},
                primaryClass={cs.AI},
                url={https://arxiv.org/abs/2412.12606}, 
          }