MBZUAI's Hanan Al Darmaki is working to improve automated speech recognition (ASR) for low-resource languages, where labeled data is scarce. She notes that Arabic presents unique challenges due to dialectal variations and a lack of written resources corresponding to spoken dialects. Al Darmaki's research focuses on unsupervised speech recognition to address this gap. Why it matters: Overcoming these challenges can improve virtual assistant effectiveness across diverse languages and enable more inclusive AI applications in the Arabic-speaking world.
The paper introduces AraELECTRA, a new Arabic language representation model. AraELECTRA is pre-trained using the replaced token detection objective on large Arabic text corpora. The model is evaluated on multiple Arabic NLP tasks, including reading comprehension, sentiment analysis, and named-entity recognition. Why it matters: AraELECTRA outperforms current state-of-the-art Arabic language representation models, given the same pretraining data and even with a smaller model size, advancing Arabic NLP.
A CMU professor and MBZUAI affiliated faculty presented research on how LLMs store and use knowledge learned during pre-training. The study used a synthetic biography dataset to show that LLMs may not effectively use memorized knowledge at inference time, even with zero training loss. Data augmentation during pre-training can force the model to store knowledge in specific token embeddings. Why it matters: The research highlights limitations in LLM knowledge manipulation and extraction, with implications for improving model architectures and training strategies for more effective knowledge utilization in Arabic LLMs.
This survey paper reviews the landscape of Natural Language Processing (NLP) research and applications in the Arab world. It discusses the unique challenges posed by the Arabic language, such as its morphological complexity and dialectal diversity. The paper also presents a historical overview of Arabic NLP and surveys various research areas, including machine translation, sentiment analysis, and speech recognition. Why it matters: The survey provides a comprehensive resource for researchers and practitioners interested in the current state and future directions of Arabic NLP, a field critical for enabling AI technologies to serve Arabic-speaking communities.
MBZUAI's Dr. Fajri Koto presented research on overcoming challenges in NLP for underrepresented languages. His work includes creating multilingual datasets for Indonesian languages by engaging native speakers and finding that direct composition yields better results than translation. He also discussed vocabulary adaptation and zero-shot learning to address computational resource limitations, and emphasized the importance of datasets with local context for evaluating LLMs. Why it matters: This research addresses critical gaps in NLP for low-resource languages, providing insights and techniques to improve performance and cultural relevance in multilingual AI models within the region and globally.
This article previews a talk by Gül Varol from Ecole des Ponts ParisTech on bridging natural language and 3D human motions. The talk will cover text-to-motion synthesis using generative models and text-to-motion retrieval models based on the ACTOR, TEMOS, TMR, TEACH, and SINC papers. Varol's research interests include video representation learning, human motion synthesis, and sign languages. Why it matters: Research in this area could enable more intuitive human-computer interaction and new applications in areas like virtual reality and robotics.
Researchers introduce a benchmark to evaluate the factual recall and knowledge transferability of multilingual language models across 13 languages. The study reveals that language models often fail to transfer knowledge between languages, even when they possess the correct information in one language. The benchmark and evaluation framework are released to drive future research in multilingual knowledge transfer.
Liangming Pan from UCSB presented research on building reliable generative AI agents by integrating symbolic representations with LLMs. The neuro-symbolic strategy combines the flexibility of language models with precise knowledge representation and verifiable reasoning. The work covers Logic-LM, ProgramFC, and learning from automated feedback, aiming to address LLM limitations in complex reasoning tasks. Why it matters: Improving the reliability of LLMs is crucial for high-stakes applications in finance, medicine, and law within the region and globally.