This study explores fine-tuning large language models (LLMs) for Arabic medical text generation to improve hospital management systems. A unique dataset was collected from social media, capturing medical conversations between patients and doctors, and used to fine-tune models like Mistral-7B, LLaMA-2-7B, and GPT-2. The fine-tuned Mistral-7B model outperformed the others with a BERT F1-score of 68.5%. Why it matters: The research demonstrates the potential of generative AI to provide scalable and culturally relevant solutions for healthcare challenges in Arabic-speaking regions.
This paper benchmarks the performance of large language models (LLMs) on Arabic medical natural language processing tasks using the AraHealthQA dataset. The study evaluated LLMs in multiple-choice question answering, fill-in-the-blank, and open-ended question answering scenarios. The results showed that a majority voting solution using Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3 achieved 77% accuracy on MCQs, while other LLMs achieved a BERTScore of 86.44% on open-ended questions. Why it matters: The research highlights both the potential and limitations of current LLMs in Arabic clinical contexts, providing a baseline for future improvements in Arabic medical AI.
Researchers address the challenge of limited Arabic medical dialogue data by generating 80,000 synthetic question-answer pairs using ChatGPT-4o and Gemini 2.5 Pro, expanding an initial dataset of 20,000 records. They fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated performance using BERTScore and expert review. Results showed that training with ChatGPT-4o-generated data led to higher F1-scores and fewer hallucinations across models. Why it matters: This demonstrates the potential of synthetic data augmentation to improve domain-specific Arabic language models, particularly for low-resource medical NLP applications.
Researchers from Georgia Tech explored Arabic medical text classification using 82 categories from the AbjadMed dataset. They compared fine-tuned AraBERTv2 encoders with hybrid pooling against multilingual encoders and large causal decoders like Llama 3.3 70B and Qwen 3B. The study found that bidirectional encoders outperformed causal decoders in capturing semantic boundaries for fine-grained medical text classification. Why it matters: The research provides insights into optimal model selection for specialized Arabic NLP tasks, specifically highlighting the effectiveness of fine-tuned encoders for medical text categorization.
A new study introduces Sporo AraSum, a language model designed for Arabic clinical documentation, and compares it to JAIS using synthetic datasets and modified PDQI-9 metrics. Sporo AraSum significantly outperformed JAIS in quantitative AI metrics and qualitative attributes related to accuracy, utility, and cultural competence. The model addresses the nuances of Arabic while reducing AI hallucinations, making it suitable for Arabic-speaking healthcare. Why it matters: The model offers a more culturally and linguistically sensitive solution for Arabic clinical documentation, potentially improving healthcare workflows and patient outcomes in the region.