Skip to content
GCC AI Research

Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets

arXiv · · Notable

Summary

Researchers have developed Masader Plus, a web interface for browsing the Masader catalog of Arabic NLP datasets. The interface allows for data exploration, filtration, and API access to examine datasets. User interactions with the website are intended to provide a way to improve the dataset catalog itself. Why it matters: This interface lowers the barrier to entry for researchers seeking Arabic NLP datasets, facilitating more research in the field.

Keywords

Arabic NLP · dataset · Masader · interface · catalog

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

arXiv ·

Researchers created Masader, the largest public catalog for Arabic NLP datasets, containing 200 datasets annotated with 25 attributes. They developed a metadata annotation strategy applicable to other languages. The paper highlights issues within current Arabic NLP datasets and suggests recommendations. Why it matters: This curated dataset directory helps lower the barrier to entry for Arabic NLP research and development.

ArabJobs: A Multinational Corpus of Arabic Job Ads

arXiv ·

The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.