Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

arXiv · October 13, 2021 · Notable

Summary

Researchers created Masader, the largest public catalog for Arabic NLP datasets, containing 200 datasets annotated with 25 attributes. They developed a metadata annotation strategy applicable to other languages. The paper highlights issues within current Arabic NLP datasets and suggests recommendations. Why it matters: This curated dataset directory helps lower the barrier to entry for Arabic NLP research and development.

Keywords

Arabic NLP · datasets · metadata · low-resource languages · Masader

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Masader Plus: A New Interface for Exploring +500 Arabic NLP Datasets

arXiv · Aug 1

Researchers have developed Masader Plus, a web interface for browsing the Masader catalog of Arabic NLP datasets. The interface allows for data exploration, filtration, and API access to examine datasets. User interactions with the website are intended to provide a way to improve the dataset catalog itself. Why it matters: This interface lowers the barrier to entry for researchers seeking Arabic NLP datasets, facilitating more research in the field.

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

arXiv · May 26

KAUST researchers introduced MOLE, a framework leveraging LLMs for automated metadata extraction from scientific papers. The system processes documents in multiple formats and validates outputs, targeting datasets beyond Arabic. A new benchmark dataset has been released to evaluate progress in metadata extraction.

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

arXiv · Jun 24

The Qatar Computing Research Institute (QCRI) has released QASR, a 2,000-hour transcribed Arabic speech corpus collected from Aljazeera news broadcasts. The dataset features multi-dialect speech sampled at 16kHz, aligned with lightly supervised transcriptions and linguistically motivated segmentation. QCRI also released a 130M word dataset to improve language model training. Why it matters: QASR enables new research in Arabic speech recognition, dialect identification, punctuation restoration, and other NLP tasks for spoken data.