SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

arXiv · September 4, 2025 · Significant research

Summary

Researchers from MBZUAI have introduced SPECS, a new reference-free evaluation metric for long image captions that modifies CLIP to emphasize specificity. SPECS aims to improve the correlation with human judgment while maintaining computational efficiency compared to LLM-based metrics. The proposed approach is intended for iterative use during image captioning model development, offering a practical alternative to existing methods.

Keywords

image captioning · evaluation metric · CLIP · specificity · SPECS

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · Dec 18

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

Summary

Keywords

Related

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos