Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

By: Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, Yuki M. Asano

Published: 2025-12-10

View on arXiv →
#cs.AIAI Analyzed#MLLM#Computer Vision#Consistency#Hallucination#AI Safety#Benchmark#Multimodal AlignmentArtificial IntelligenceComputer VisionAutomated TestingHealthcare TechnologyAutonomous Systems

Abstract

This paper addresses the critical issue of Multimodal Large Language Models (MLLMs) producing inconsistent or different answers when presented with the same information through various input modalities, highlighting a key challenge for reliable real-world applications.

Impact

practical

comments

5

Topics

7

💡 Simple Explanation

Imagine you show an AI a photo of a rainy street and ask, 'Is it safe to drive?' It says 'No'. Then, you write a text message to the same AI describing the photo perfectly: 'A street with heavy rain and puddles.' The AI reads it and says 'Yes, proceed with caution.' This paper studies this confusing behavior. It finds that advanced AI models often give different answers depending on whether they are looking at a picture or reading a description of that picture. This 'inconsistency' is a problem for building trustworthy AI systems.

🎯 Problem Statement

Multimodal Large Language Models (MLLMs) lack semantic alignment between their visual and textual processing pathways. Even if a model understands a concept in text, it might fail to apply that same understanding when perceiving the concept visually, leading to unpredictable and untrustworthy behavior in real-world applications.

🔬 Methodology

The authors created a benchmark dataset containing image-question pairs. They then used an 'Oracle' method (high-quality captioning/OCR) to convert the images into detailed text. State-of-the-art MLLMs were queried with the image+question and the text+question separately. The outputs were compared using a systematic metric called Cross-Modal Consistency (CMC), utilizing an LLM to judge semantic equivalence between the two answers.

📊 Results

The study demonstrates that even top-tier models like GPT-4V and Gemini exhibit notable Cross-Modal Inconsistency. In many cases, the text-only modality (fed with descriptions) outperforms the direct visual modality in reasoning tasks. The consistency scores (CMC) drop significantly as the complexity of the task increases, showing that visual encoders and language backbones are not perfectly synchronized in their latent reasoning space.

Key Takeaways

1. Do not assume MLLMs are consistent across modalities; valid text reasoning does not imply valid visual reasoning. 2. Text augmentation (using captions) can improve reliability. 3. New training paradigms are needed that penalize cross-modal divergence to create truly robust multimodal agents.

🔍 Critical Analysis

This paper provides a necessary reality check for the MLLM field. While recent progress has focused on capability (can it see?), this paper rightly shifts focus to reliability (can we trust what it sees?). The methodology is sound, though relying on text descriptions as a proxy for 'correct' image content is a slight limitation, as descriptions are lossy. However, the revelation that models are internally inconsistent is a strong indicator that current 'visual instruction tuning' is insufficient for true multimodal grounding.

💰 Practical Applications

  • B2B API wrapper that guarantees multimodal consistency for enterprise use.
  • Auditing tools for regulatory compliance of AI models.
  • Dataset marketplace for 'hard consistency negatives' to improve model training.

🏷️ Tags

#MLLM#Computer Vision#Consistency#Hallucination#AI Safety#Benchmark#Multimodal Alignment

🏢 Relevant Industries

Artificial IntelligenceComputer VisionAutomated TestingHealthcare TechnologyAutonomous Systems

📈 Engagement

comments: 5
AI Discussions: 3