Researchers introduce ASAD, a new large-scale, high-quality Arabic Sentiment Analysis Dataset based on 95K tweets with positive, negative, and neutral labels. The dataset is launched with a competition sponsored by KAUST offering a total of 17000 USD in prizes. Baseline models are implemented and results reported to provide a reference for competition participants.
KAUST organized an Arabic Sentiment Analysis Challenge where participants developed ML models to classify tweets as positive, negative, or neutral. The competition used the ASAD dataset with 55K tweets for training, 20K for validation, and 20K for final evaluation. The full dataset of 100K labeled tweets has been released for public use.
The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.
A new dataset called the Saudi Privacy Policy Dataset is introduced, which contains Arabic privacy policies from various sectors in Saudi Arabia. The dataset is annotated based on the 10 principles of the Personal Data Protection Law (PDPL) and includes 1,000 websites, 4,638 lines of text, and 775,370 tokens. The dataset aims to facilitate research and development in privacy policy analysis, NLP, and machine learning applications related to data protection.
This paper introduces a new task: detecting propaganda techniques in code-switched text. The authors created and released a corpus of 1,030 English-Roman Urdu code-switched texts annotated with 20 propaganda techniques. Experiments show the importance of directly modeling multilinguality and using the right fine-tuning strategy for this task.