Skip to content
GCC AI Research

ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models

arXiv · · Significant research

Summary

Researchers introduce ArabLegalEval, a multitask benchmark dataset for assessing Arabic legal knowledge in LLMs. The dataset contains tasks sourced from Saudi legal documents and synthesized questions, drawing inspiration from MMLU and LegalBench. Experiments benchmarked models including GPT-4 and Jais, exploring in-context learning and various evaluation methods. Why it matters: This resource should help accelerate AI research and evaluation in the Arabic legal domain, where datasets are lacking.

Keywords

Arabic · legal · LLM · benchmark · dataset

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

ALARB: An Arabic Legal Argument Reasoning Benchmark

arXiv ·

Researchers introduce ALARB, a new benchmark for evaluating reasoning in Arabic LLMs using 13K Saudi commercial court cases. The benchmark includes tasks like verdict prediction, reasoning chain completion, and identification of relevant regulations. Instruction-tuning a 12B parameter model on ALARB achieves performance comparable to GPT-4o in verdict prediction and generation.