A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation
Evaluating image captions requires a cohesive assessment of both visual semantics and language pragmatics, which is often not fully captured by existing metrics. We introduce the Redemption Score (RS), a novel hybrid framework that ranks image captions by triangulating three complementary signals: Mutual Information Divergence (MID) for distributional alignment, DINO-based perceptual similarity for visual grounding, and LLM Text Embeddings for contextual similarity. By fusing these signals, RS offers a more holistic and nuanced evaluation of caption quality, demonstrating a superior correlation with human judgments without needing task-specific training.
[cite_start]The Redemption Score is a training-free framework that integrates three distinct metrics to overcome individual biases and create a more robust evaluation system. [cite: 207] [cite_start]The final score is a calibrated fusion of these signals. [cite: 90]
Captures the global alignment between image and text distributions using Mutual Information Divergence. [cite_start]This helps detect statistical outliers and ensures the caption fits expected patterns. [cite: 89]
Measures visual consistency by regenerating an image from the caption and comparing it to the original using the DINO vision transformer. [cite_start]This flags visual inaccuracies. [cite: 89, 341]
Utilizes contextual text embeddings to assess semantic similarity between the candidate caption and human references, catching linguistic mismatches that other metrics might miss. [cite: 89]
On the Flickr8k benchmark, the Redemption Score demonstrates a superior correlation with human judgments compared to previous methods.
This result highlights the effectiveness of our multi-modal approach in capturing the nuances of caption quality. [cite_start]The framework also shows strong generalization across other datasets like MS-COCO and Conceptual Captions. [cite: 93]
Download Full Paper (PDF)If you find this work useful in your research, please consider citing the paper.
Anonymous. (2025). Redemption Score: A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation. *WACV 2026 Submission*.