Skip to content
GCC AI Research

Search

Results for "Qwen2"

Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing

arXiv ·

Researchers fine-tuned the Qwen2-1.5B model for Arabic using QLoRA on a 4GB VRAM system, using datasets like Bactrian and Arabic Wikipedia. They addressed challenges in Arabic NLP including morphology and dialectal variations. After 10,000 training steps, the final loss converged to 0.1083 with improved handling of Arabic-specific linguistic phenomena. Why it matters: This demonstrates a resource-efficient approach for creating specialized Arabic language models, democratizing access to advanced NLP technologies.

K2: An open source model that delivers frontier capabilities

MBZUAI ·

MBZUAI's Institute of Foundation Models has released K2, a 70-billion-parameter, reasoning-centric foundation model. K2 is designed to be fully inspectable, with open weights, training code, data composition, mid-training checkpoints, and evaluation harnesses. K2 outperforms Qwen2.5-72B and approaches the performance of Qwen3-235B. Why it matters: This release promotes transparency and reproducibility in AI development, providing researchers with the resources needed to study, adapt, and build upon a strong foundation model.

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

arXiv ·

The paper introduces AraToken, an Arabic-optimized tokenizer based on the SentencePiece Unigram algorithm that incorporates a normalization pipeline to handle Arabic-specific orthographic variations. Experiments show that AraToken achieves 18% lower fertility compared to unnormalized baselines. The Language Extension Pipeline (LEP) is introduced to integrate AraToken into Qwen3-0.6B, reducing evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. Why it matters: This research provides an efficient tokenizer tailored for Arabic, improving performance of LLMs on Arabic text and benefiting Arabic NLP research by providing released resources.

K2 Think V2: a fully sovereign reasoning model

MBZUAI ·

MBZUAI's Institute of Foundation Models (IFM) has released K2 Think V2, a 70 billion parameter open-source general reasoning model built on K2 V2 Instruct. The model excels in complex reasoning benchmarks like AIME2025 and GPQA-Diamond, and features a low hallucination rate with long context reasoning capabilities. K2 Think V2 is fully sovereign and open, from pre-training through post-training, using IFM-curated data and a Guru dataset. Why it matters: This release contributes to closing the gap between community-owned reproducible AI and proprietary models, particularly in reasoning and long-context understanding for Arabic NLP tasks.

K2-V2: Full Openness Finally Meets Real Performance

MBZUAI ·

IFM has released K2-V2, a 70B-class LLM that takes a "360-open" approach by making its weights, data, training details, checkpoints, and fine-tuning recipes publicly available. K2-V2 matches leading open-weight model performance while offering full transparency, contrasting with proprietary and semi-open Chinese models. Independent evaluations show K2 as a high-performance, fully open-source alternative in the AI landscape. Why it matters: K2-V2 provides developers with a transparent and reproducible foundation model, fostering trust and enabling customization without sacrificing performance, which is crucial for sensitive applications in the region.

NatiQ: An End-to-end Text-to-Speech System for Arabic

arXiv ·

Qatar Computing Research Institute (QCRI) has developed NatiQ, an end-to-end text-to-speech (TTS) system for Arabic utilizing encoder-decoder architectures. The system employs Tacotron-based models and Transformer models to generate mel-spectrograms, which are then synthesized into waveforms using vocoders like WaveRNN, WaveGlow, and Parallel WaveGAN. Trained on in-house speech data featuring a neutral male voice (Hamza) and an expressive female voice (Amina), NatiQ achieves a Mean Opinion Score (MOS) of 4.21 and 4.40, respectively. Why it matters: This research advances Arabic language technology, providing high-quality TTS synthesis that can enhance accessibility and usability of digital content for Arabic speakers.