Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

arXiv · May 20, 2026 · Significant research

NLP Arabic AI Research Infrastructure Policy

Summary

This paper reflects on two decades of building NLP resources and research infrastructure for Arabic, an historically underserved language. The first decade focused on foundational linguistic infrastructure, while the second shifted towards computational social science and socially oriented applications. The authors highlight three lessons: dataset building is a social process, communities often matter more than shared tasks, and computational social science exposes challenges beyond traditional NLP training. Why it matters: The paper argues that the most difficult problems in developing NLP for underserved communities are social, institutional, and epistemic, offering critical insights for future research directions in Arabic AI.

Keywords

Arabic NLP · Language Resources · Underserved Languages · Computational Social Science · Research Infrastructure

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.