Skip to content
GCC AI Research

Search

Results for "RedPajama"

SlimPajama-DC: Understanding Data Combinations for LLM Training

arXiv ·

Researchers at MBZUAI release SlimPajama-DC, an empirical analysis of data combinations for pretraining LLMs using the SlimPajama dataset. The study examines the impact of global vs. local deduplication and the proportions of highly-deduplicated multi-source datasets. Results show that increased data diversity after global deduplication is crucial, with the best configuration outperforming models trained on RedPajama.

CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks

arXiv ·

The paper introduces Juhaina, a 9.24B parameter Arabic-English bilingual LLM trained with an 8,192 token context window. It identifies limitations in the Open Arabic LLM Leaderboard (OALL) and proposes a new benchmark, CamelEval, for more comprehensive evaluation. Juhaina outperforms models like Llama and Gemma in generating helpful Arabic responses and understanding cultural nuances. Why it matters: This culturally-aligned LLM and associated benchmark could significantly advance Arabic NLP and democratize AI access for Arabic speakers.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv ·

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

arXiv ·

Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.

AraGPT2: Pre-Trained Transformer for Arabic Language Generation

arXiv ·

The paper introduces AraGPT2, a suite of pre-trained transformer models for Arabic language generation, with the largest model (AraGPT2-mega) containing 1.46 billion parameters. Trained on a large Arabic corpus of internet text and news, AraGPT2-mega demonstrates strong performance in synthetic news generation and zero-shot question answering. To address the risk of misuse, the authors also released a discriminator model with 98% accuracy in detecting AI-generated text. Why it matters: This release of both the model and discriminator fills a critical gap in Arabic NLP and encourages further research and applications in the field.