ParlaMint is a CLARIN ERIC flagship project focused on harmonizing multilingual corpora of parliamentary sessions. The newest version, published in October 2023, covers 26 European parliaments with linguistic annotations and machine translations to English. Maciej Ogrodniczuk, Head of Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences, presented the project. Why it matters: While focused on European parliaments, the ParlaMint project provides a valuable model and infrastructure for creating comparable Arabic parliamentary corpora, which could enhance Arabic NLP research and political analysis in the Middle East.
This paper introduces a new task: detecting propaganda techniques in code-switched text. The authors created and released a corpus of 1,030 English-Roman Urdu code-switched texts annotated with 20 propaganda techniques. Experiments show the importance of directly modeling multilinguality and using the right fine-tuning strategy for this task.
This paper benchmarks multilingual and monolingual LLM performance across Arabic, English, and Indic languages, examining model compression effects like pruning and quantization. Multilingual models outperform language-specific counterparts, demonstrating cross-lingual transfer. Quantization maintains accuracy while promoting efficiency, but aggressive pruning compromises performance, particularly in larger models. Why it matters: The findings highlight strategies for scalable and fair multilingual NLP, addressing hallucination and generalization errors in low-resource languages.
Researchers compiled a 101 Billion Arabic Words Dataset by mining text from Common Crawl WET files and rigorously cleaning and deduplicating the extracted content. The dataset aims to address the scarcity of original, high-quality Arabic linguistic data, which often leads to bias in Arabic LLMs that rely on translated English data. This is the largest Arabic dataset available to date. Why it matters: The new dataset can significantly contribute to the development of authentic Arabic LLMs that are more linguistically and culturally accurate.
A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.
MBZUAI releases Bactrian-X, a multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. They trained low-rank adaptation (LoRA) adapters using this dataset, creating lightweight, replaceable components for large language models. Experiments show the LoRA-based models outperform vanilla and existing instruction-tuned models in multilingual settings.
Researchers introduce a benchmark to evaluate the factual recall and knowledge transferability of multilingual language models across 13 languages. The study reveals that language models often fail to transfer knowledge between languages, even when they possess the correct information in one language. The benchmark and evaluation framework are released to drive future research in multilingual knowledge transfer.