Middle East AI

Topics

Benchmark

6 articles RSS ↗

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

arXiv · · LLM Research

MBZUAI researchers introduce SocialMaze, a new benchmark for evaluating social reasoning capabilities in large language models (LLMs). SocialMaze includes six diverse tasks across social reasoning games, daily-life interactions, and digital community platforms, emphasizing deep reasoning, dynamic interaction, and information uncertainty. Experiments show that LLMs vary in handling dynamic interactions, degrade under uncertainty, but can be improved via fine-tuning on curated reasoning examples.

Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

arXiv · · NLP LLM

MBZUAI researchers release 'Fann or Flop', a new benchmark for evaluating Arabic poetry understanding in LLMs. The benchmark covers 12 historical eras and 14 poetic genres, assessing semantic understanding, metaphor interpretation, and cultural context. Evaluation of state-of-the-art LLMs reveals challenges in poetic understanding despite strong performance on standard Arabic benchmarks.

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

arXiv · · LLM Research

MBZUAI researchers introduce LLM-BabyBench, a benchmark suite for evaluating grounded planning and reasoning in LLMs. The suite, built on a textual adaptation of the BabyAI grid world, assesses LLMs on predicting action consequences, generating action sequences, and decomposing instructions. Datasets, evaluation harness, and metrics are publicly available to facilitate reproducible assessment.

M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection

arXiv · · NLP LLM

MBZUAI researchers introduce M4GT-Bench, a new benchmark for evaluating machine-generated text (MGT) detection across multiple languages and domains. The benchmark includes tasks for binary MGT detection, identifying the specific model that generated the text, and detecting mixed human-machine text. Experiments with baseline models and human evaluation show that MGT detection performance is highly dependent on access to training data from the same domain and generators.

Universal Adversarial Examples in Remote Sensing: Methodology and Benchmark

arXiv · · CV Research

This paper introduces a novel black-box adversarial attack method, Mixup-Attack, to generate universal adversarial examples for remote sensing data. The method identifies common vulnerabilities in neural networks by attacking features in the shallow layer of a surrogate model. The authors also present UAE-RS, the first dataset of black-box adversarial samples in remote sensing, to benchmark the robustness of deep learning models against adversarial attacks.