Middle East AI

This Week arXiv

Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021

arXiv · · Notable

Summary

This paper introduces two shared tasks for abusive and threatening language detection in Urdu, a low-resource language with over 170 million speakers. The tasks involve binary classification of Urdu tweets into Abusive/Non-Abusive and Threatening/Non-Threatening categories, respectively. Datasets of 2400/6000 training tweets and 1100/3950 testing tweets were created and manually annotated, along with logistic regression and BERT-based baselines. 21 teams participated and the best systems achieved F1-scores of 0.880 and 0.545 on the abusive and threatening language tasks, respectively, with m-BERT showing the best performance.

Keywords

Urdu · abusive language detection · threatening language detection · FIRE 2021 · m-BERT

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2021

arXiv ·

This paper provides an overview of the UrduFake@FIRE2021 shared task, which focused on fake news detection in the Urdu language. The task involved binary classification of news articles into real or fake categories using a dataset of 1300 training and 300 testing articles across five domains. 34 teams registered, with 18 submitting results and 11 providing technical reports detailing various approaches from BoW to Transformer models, with the best system achieving an F1-macro score of 0.679.

UrduFake@FIRE2021: Shared Track on Fake News Identification in Urdu

arXiv ·

The UrduFake@FIRE2021 shared task focused on fake news detection in the Urdu language, framed as a binary classification problem. 34 teams registered, with 18 submitting results and 11 providing technical reports, showcasing diverse approaches. The top-performing system utilized the stochastic gradient descent (SGD) algorithm, achieving an F-score of 0.679.

Detecting Propaganda Techniques in Code-Switched Social Media Text

arXiv ·

This paper introduces a new task: detecting propaganda techniques in code-switched text. The authors created and released a corpus of 1,030 English-Roman Urdu code-switched texts annotated with 20 propaganda techniques. Experiments show the importance of directly modeling multilinguality and using the right fine-tuning strategy for this task.

UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

arXiv ·

Researchers from MBZUAI have introduced UrduFactCheck, a new framework for fact-checking in Urdu, along with two datasets: UrduFactBench and UrduFactQA. The framework uses monolingual and translation-based evidence retrieval to address the lack of Urdu resources. Evaluations using twelve LLMs showed that translation-augmented methods improve performance, highlighting challenges for open-source LLMs in Urdu.