Redemption Score: A Multi-Modal Evaluation Framework for Image Captioning

Abstract

Evaluating image captions requires a cohesive assessment of both visual semantics and language pragmatics, which is often not fully captured by existing metrics. We introduce the Redemption Score (RS), a novel hybrid framework that ranks image captions by triangulating three complementary signals: Mutual Information Divergence (MID) for distributional alignment, DINO-based perceptual similarity for visual grounding, and LLM Text Embeddings for contextual similarity. By fusing these signals, RS offers a more holistic and nuanced evaluation of caption quality, demonstrating a superior correlation with human judgments without needing task-specific training.

Diagram of the Redemption Score framework

[cite_start]

An overview of the Redemption Score framework, showing the triangulation of MID, DINO, and BERTScore signals. [cite: 1]

Methodology ⚙️

[cite_start]

The Redemption Score is a training-free framework that integrates three distinct metrics to overcome individual biases and create a more robust evaluation system. [cite: 207] [cite_start]The final score is a calibrated fusion of these signals. [cite: 90]

1. MID (Distributional)

Captures the global alignment between image and text distributions using Mutual Information Divergence. [cite_start]This helps detect statistical outliers and ensures the caption fits expected patterns. [cite: 89]

2. DINO (Perceptual)

Measures visual consistency by regenerating an image from the caption and comparing it to the original using the DINO vision transformer. [cite_start]This flags visual inaccuracies. [cite: 89, 341]

3. BERTScore (Linguistic)

[cite_start]

Utilizes contextual text embeddings to assess semantic similarity between the candidate caption and human references, catching linguistic mismatches that other metrics might miss. [cite: 89]

Key Results 🏆

On the Flickr8k benchmark, the Redemption Score demonstrates a superior correlation with human judgments compared to previous methods.

58.42%

[cite_start]

Kendall-τ Correlation with Human Judgments [cite: 91, 564]

This result highlights the effectiveness of our multi-modal approach in capturing the nuances of caption quality. [cite_start]The framework also shows strong generalization across other datasets like MS-COCO and Conceptual Captions. [cite: 93]

Download Full Paper (PDF)

Cite This Work

If you find this work useful in your research, please consider citing the paper.

Anonymous. (2025). Redemption Score: A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal Triangulation. *WACV 2026 Submission*.