Middle East AI

This Week arXiv

ASAD: A Twitter-based Benchmark Arabic Sentiment Analysis Dataset

arXiv · · Notable

Summary

Researchers introduce ASAD, a new large-scale, high-quality Arabic Sentiment Analysis Dataset based on 95K tweets with positive, negative, and neutral labels. The dataset is launched with a competition sponsored by KAUST offering a total of 17000 USD in prizes. Baseline models are implemented and results reported to provide a reference for competition participants.

Keywords

Arabic Sentiment Analysis · Dataset · ASAD · Twitter · KAUST

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Overview of the Arabic Sentiment Analysis 2021 Competition at KAUST

arXiv ·

KAUST organized an Arabic Sentiment Analysis Challenge where participants developed ML models to classify tweets as positive, negative, or neutral. The competition used the ASAD dataset with 55K tweets for training, 20K for validation, and 20K for final evaluation. The full dataset of 100K labeled tweets has been released for public use.

ArabJobs: A Multinational Corpus of Arabic Job Ads

arXiv ·

The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.

The Saudi Privacy Policy Dataset

arXiv ·

A new dataset called the Saudi Privacy Policy Dataset is introduced, which contains Arabic privacy policies from various sectors in Saudi Arabia. The dataset is annotated based on the 10 principles of the Personal Data Protection Law (PDPL) and includes 1,000 websites, 4,638 lines of text, and 775,370 tokens. The dataset aims to facilitate research and development in privacy policy analysis, NLP, and machine learning applications related to data protection.

Detecting Propaganda Techniques in Code-Switched Social Media Text

arXiv ·

This paper introduces a new task: detecting propaganda techniques in code-switched text. The authors created and released a corpus of 1,030 English-Roman Urdu code-switched texts annotated with 20 propaganda techniques. Experiments show the importance of directly modeling multilinguality and using the right fine-tuning strategy for this task.