The paper introduces a novel actor-critic framework called Distillation Policy Optimization that combines on-policy and off-policy data for reinforcement learning. It incorporates variance reduction mechanisms like a unified advantage estimator (UAE) and a residual baseline. The empirical results demonstrate improved sample efficiency for on-policy algorithms, bridging the gap with off-policy methods.
A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.
This paper presents team SPPU-AASM's hybrid model for Arabic sarcasm and sentiment detection in the WANLP ArSarcasm shared task 2021. The model combines sentence representations from AraBERT with static word vectors trained on Arabic social media corpora. Results show the system achieves an F1-sarcastic score of 0.62 and a F-PN score of 0.715, outperforming existing approaches. Why it matters: The research demonstrates that combining context-free and contextualized representations improves performance in nuanced Arabic NLP tasks like sarcasm and sentiment analysis.
This paper introduces a new task: detecting propaganda techniques in code-switched text. The authors created and released a corpus of 1,030 English-Roman Urdu code-switched texts annotated with 20 propaganda techniques. Experiments show the importance of directly modeling multilinguality and using the right fine-tuning strategy for this task.
This survey paper analyzes over 40 benchmarks used to evaluate Arabic large language models, categorizing them into Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. It identifies progress in benchmark diversity but also highlights gaps like limited temporal evaluation and cultural misalignment. The paper also examines methods for creating benchmarks, including native collection, translation, and synthetic generation. Why it matters: The survey provides a comprehensive reference for Arabic NLP research and offers recommendations for future benchmark development to better align with cultural contexts.