Middle East AI

Topics

Evaluation

2 articles RSS ↗

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

arXiv · · CV LLM

Researchers from MBZUAI have introduced the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) for assessing Video-LLMs. The benchmark evaluates models across 11 real-world video dimensions, revealing challenges in robustness and reasoning, particularly for open-source models. A training-free Dual-Step Contextual Prompting (DSCP) technique is proposed to enhance Video-LMM performance, with the dataset and code made publicly available.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

arXiv · · LLM Research

Researchers from the National Center for AI in Saudi Arabia investigated the sensitivity of Large Language Model (LLM) leaderboards to minor benchmark perturbations. They found that small changes, like choice order, can shift rankings by up to 8 positions. The study recommends hybrid scoring and warns against over-reliance on simple benchmark evaluations, providing code for further research.