Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation

arXiv · April 6, 2026 · Significant research

NLP Research Arabic AI LLM Infrastructure

Summary

Researchers have developed OmniScore, a family of deterministic learned metrics designed to evaluate generative text as an alternative to Large Language Models (LLMs) used as judges. OmniScore leverages small parameter models (<1B) and was trained on approximately 564,000 synthetic instances across 107 languages, then evaluated using 8,617 manually annotated instances. It approximates LLM-judge behavior while offering low latency and consistency for various evaluation settings like reference-based and source-grounded assessments in tasks like QA, translation, and summarization. Why it matters: This development provides a practical, scalable, and reproducible method for multilingual generative text evaluation, addressing key limitations of LLM-as-a-judge approaches and offering significant benefits for AI development in linguistically diverse regions.

Keywords

OmniScore · Generative Text Evaluation · Multilingual NLP · LLM-as-a-Judge · QCRI

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.