Skip to content
GCC AI Research

Search

Results for "multilingual corpus"

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

arXiv ·

A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.

Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

arXiv ·

This paper benchmarks multilingual and monolingual LLM performance across Arabic, English, and Indic languages, examining model compression effects like pruning and quantization. Multilingual models outperform language-specific counterparts, demonstrating cross-lingual transfer. Quantization maintains accuracy while promoting efficiency, but aggressive pruning compromises performance, particularly in larger models. Why it matters: The findings highlight strategies for scalable and fair multilingual NLP, addressing hallucination and generalization errors in low-resource languages.

ParlaMint 4.0: Parliamentary Debates going Comparable

MBZUAI ·

ParlaMint is a CLARIN ERIC flagship project focused on harmonizing multilingual corpora of parliamentary sessions. The newest version, published in October 2023, covers 26 European parliaments with linguistic annotations and machine translations to English. Maciej Ogrodniczuk, Head of Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences, presented the project. Why it matters: While focused on European parliaments, the ParlaMint project provides a valuable model and infrastructure for creating comparable Arabic parliamentary corpora, which could enhance Arabic NLP research and political analysis in the Middle East.

Predicting and Explaining Cross-lingual Zero-shot and Few-shot Transfer in LLMs

MBZUAI ·

Project LITMUS explores predicting cross-lingual transfer accuracy in multilingual language models, even without test data in target languages. The goal is to estimate model performance in low-resource languages and optimize training data for desired cross-lingual performance. This research aims to identify factors influencing cross-lingual transfer, contributing to linguistically fair MMLMs. Why it matters: Improving cross-lingual transfer is vital for creating more equitable and effective multilingual AI systems, especially for languages with limited resources.

Performance Prediction via Bayesian Matrix Factorisation for Multilingual Natural Language Processing Tasks

MBZUAI ·

A new Bayesian matrix factorization approach is explored for performance prediction in multilingual NLP, aiming to reduce the experimental burden of evaluating various language combinations. The approach outperforms state-of-the-art methods in NLP benchmarks like machine translation and cross-lingual entity linking. It also avoids hyperparameter tuning and provides uncertainty estimates over predictions. Why it matters: Accurate performance prediction methods accelerate multilingual NLP research by reducing computational costs and improving experimental efficiency, especially valuable for Arabic NLP tasks.

Detecting Propaganda Techniques in Code-Switched Social Media Text

arXiv ·

This paper introduces a new task: detecting propaganda techniques in code-switched text. The authors created and released a corpus of 1,030 English-Roman Urdu code-switched texts annotated with 20 propaganda techniques. Experiments show the importance of directly modeling multilinguality and using the right fine-tuning strategy for this task.

A Cross-cultural Corpus of Annotated Verbal and Nonverbal Behaviors in Receptionist Encounters

arXiv ·

Researchers created a cross-cultural corpus of annotated verbal and nonverbal behaviors in receptionist interactions. The corpus includes native speakers of American English and Arabic role-playing scenarios at university reception desks in Doha, Qatar, and Pittsburgh, USA. The manually annotated nonverbal behaviors include gaze direction, hand gestures, torso positions, and facial expressions. Why it matters: This resource can be valuable for the human-robot interaction community, especially for building culturally aware AI systems.