101 Billion Arabic Words Dataset

arXiv · April 29, 2024 · Significant research

Summary

Researchers compiled a 101 Billion Arabic Words Dataset by mining text from Common Crawl WET files and rigorously cleaning and deduplicating the extracted content. The dataset aims to address the scarcity of original, high-quality Arabic linguistic data, which often leads to bias in Arabic LLMs that rely on translated English data. This is the largest Arabic dataset available to date. Why it matters: The new dataset can significantly contribute to the development of authentic Arabic LLMs that are more linguistically and culturally accurate.

Keywords

Arabic LLM · dataset · Common Crawl · data mining · bias

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

ArabJobs: A Multinational Corpus of Arabic Job Ads

arXiv · Sep 26

The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

arXiv · Jun 24

The Qatar Computing Research Institute (QCRI) has released QASR, a 2,000-hour transcribed Arabic speech corpus collected from Aljazeera news broadcasts. The dataset features multi-dialect speech sampled at 16kHz, aligned with lightly supervised transcriptions and linguistically motivated segmentation. QCRI also released a 130M word dataset to improve language model training. Why it matters: QASR enables new research in Arabic speech recognition, dialect identification, punctuation restoration, and other NLP tasks for spoken data.

101 Billion Arabic Words Dataset

Summary

Keywords

Related

ArabJobs: A Multinational Corpus of Arabic Job Ads

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus