This paper introduces a predictive analysis of Arabic court decisions, utilizing 10,813 real commercial court cases. The study evaluates LLaMA-7b, JAIS-13b, and GPT3.5-turbo models under zero-shot, one-shot, and fine-tuned training paradigms, also experimenting with summarization and translation. GPT-3.5 models significantly outperformed others, exceeding JAIS model performance by 50%, while also demonstrating the unreliability of most automated metrics. Why it matters: This research bridges computational linguistics and Arabic legal analytics, offering insights for enhancing judicial processes and legal strategies in the Arabic-speaking world.
The paper introduces a web-based expert system called RCSES for civil service regulations in Saudi Arabia. The system covers 17 regulations and utilizes XML for knowledge representation and ASP.net for rule-based inference. RCSES was validated by domain experts and technical users, and compared favorably to other web-based expert systems.
Researchers introduce ALARB, a new benchmark for evaluating reasoning in Arabic LLMs using 13K Saudi commercial court cases. The benchmark includes tasks like verdict prediction, reasoning chain completion, and identification of relevant regulations. Instruction-tuning a 12B parameter model on ALARB achieves performance comparable to GPT-4o in verdict prediction and generation.
The paper introduces a benchmark of 1,000 multiple-choice questions to evaluate LLMs on Islamic inheritance law ('ilm al-mawarith). Seven LLMs were tested, with o3 and Gemini 2.5 achieving over 90% accuracy, while ALLaM, Fanar, LLaMA, and Mistral scored below 50%. Error analysis revealed limitations in handling structured legal reasoning. Why it matters: This research highlights the challenges and opportunities for adapting LLMs to complex, culturally-specific legal domains like Islamic jurisprudence.
Researchers introduce ArabLegalEval, a multitask benchmark dataset for assessing Arabic legal knowledge in LLMs. The dataset contains tasks sourced from Saudi legal documents and synthesized questions, drawing inspiration from MMLU and LegalBench. Experiments benchmarked models including GPT-4 and Jais, exploring in-context learning and various evaluation methods. Why it matters: This resource should help accelerate AI research and evaluation in the Arabic legal domain, where datasets are lacking.