Skip to content
GCC AI Research

Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

arXiv · · Significant research

Summary

Researchers developed a retrieval-augmented generation (RAG) framework to improve Arabic Large Language Models (LLMs) in understanding complex historical and religious texts like the Quran and Hadith. This framework grounds LLMs in the Doha Historical Dictionary of Arabic (DHDA) through hybrid retrieval and intent-based routing. The approach significantly boosted the accuracy of Arabic-native LLMs such as Fanar and ALLaM to over 85%, closing the performance gap with proprietary models like Gemini. Why it matters: This research offers a novel method for enhancing Arabic NLP capabilities for historically nuanced texts, demonstrating the value of integrating diachronic lexicographic resources into RAG systems for deeper language understanding.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning

arXiv ·

The QU-NLP team presented their approach to the QIAS 2025 shared task on Islamic Inheritance Reasoning, fine-tuning the Fanar-1-9B model using LoRA and integrating it into a RAG pipeline. Their system achieved an accuracy of 0.858 on the final test, outperforming models like GPT 4.5, LLaMA, and Mistral in zero-shot settings. The system particularly excelled in advanced reasoning, achieving 97.6% accuracy. Why it matters: This demonstrates the effectiveness of domain-specific fine-tuning and retrieval augmentation for Arabic LLMs in complex reasoning tasks, even surpassing frontier models.

Quranic Conversations: Developing a Semantic Search tool for the Quran using Arabic NLP Techniques

arXiv ·

Researchers developed a semantic search tool for the Quran using Arabic NLP techniques. The tool was trained on a dataset of over 30 tafsirs (interpretations) of the Quran. Using the SNxLM model and cosine similarity, the tool identifies Quranic verses most relevant to a user's query, achieving a similarity score of up to 0.97. Why it matters: This tool could significantly improve access to the Quran's teachings for Arabic speakers and researchers, providing a valuable resource for religious study and understanding.