Skip to content
GCC AI Research

Search

Results for "AraHealthQA"

Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks

arXiv ·

This paper benchmarks the performance of large language models (LLMs) on Arabic medical natural language processing tasks using the AraHealthQA dataset. The study evaluated LLMs in multiple-choice question answering, fill-in-the-blank, and open-ended question answering scenarios. The results showed that a majority voting solution using Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3 achieved 77% accuracy on MCQs, while other LLMs achieved a BERTScore of 86.44% on open-ended questions. Why it matters: The research highlights both the potential and limitations of current LLMs in Arabic clinical contexts, providing a baseline for future improvements in Arabic medical AI.

AraGPT2: Pre-Trained Transformer for Arabic Language Generation

arXiv ·

The paper introduces AraGPT2, a suite of pre-trained transformer models for Arabic language generation, with the largest model (AraGPT2-mega) containing 1.46 billion parameters. Trained on a large Arabic corpus of internet text and news, AraGPT2-mega demonstrates strong performance in synthetic news generation and zero-shot question answering. To address the risk of misuse, the authors also released a discriminator model with 98% accuracy in detecting AI-generated text. Why it matters: This release of both the model and discriminator fills a critical gap in Arabic NLP and encourages further research and applications in the field.

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

arXiv ·

Researchers introduce ArabicaQA, a large-scale dataset for Arabic question answering, comprising 89,095 answerable and 3,701 unanswerable questions. They also present AraDPR, a dense passage retrieval model trained on the Arabic Wikipedia. The paper includes benchmarking of large language models (LLMs) for Arabic question answering. Why it matters: This work addresses a significant gap in Arabic NLP resources and provides valuable tools and benchmarks for advancing research in the field.

AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic

arXiv ·

The paper introduces AraTrust, a new benchmark for evaluating the trustworthiness of LLMs when prompted in Arabic. The benchmark contains 522 multiple-choice questions covering dimensions like truthfulness, ethics, safety, and fairness. Experiments using AraTrust showed that GPT-4 performed the best, while open-source models like AceGPT 7B and Jais 13B had lower scores. Why it matters: This benchmark addresses a critical gap in evaluating LLMs for Arabic, which is essential for ensuring the safe and ethical deployment of AI in the Arab world.

AraNet: A Deep Learning Toolkit for Arabic Social Media

arXiv ·

Researchers introduce AraNet, a deep learning toolkit for Arabic social media processing. The toolkit uses BERT models trained on social media datasets to predict age, dialect, gender, emotion, irony, and sentiment. AraNet achieves state-of-the-art or competitive performance on these tasks without feature engineering. Why it matters: The public release of AraNet accelerates Arabic NLP research by providing a comprehensive, deep learning-based tool for various social media analysis tasks.