Researchers from MBZUAI have released MobiLlama, a fully transparent open-source 0.5 billion parameter Small Language Model (SLM). MobiLlama is designed for resource-constrained devices, emphasizing enhanced performance with reduced resource demands. The full training data pipeline, code, model weights, and checkpoints are available on Github.
Researchers introduce PALO, a polyglot large multimodal model with visual reasoning capabilities in 10 major languages including Arabic. A semi-automated translation approach was used to adapt the multimodal instruction dataset from English to the target languages. The models are trained across three scales (1.7B, 7B and 13B parameters) and a multilingual multimodal benchmark is proposed for evaluation.
MBZUAI researchers introduce BiMediX, a bilingual (English and Arabic) mixture of experts LLM for medical applications. The model is trained on BiMed1.3M, a new 1.3 million bilingual instruction dataset and outperforms existing models like Med42 and Jais-30B on medical benchmarks. Code and models are available on Github.
This paper investigates the intrinsic self-correction capabilities of LLMs, identifying model confidence as a key latent factor. Researchers developed an "If-or-Else" (IoE) prompting framework to guide LLMs in assessing their own confidence and improving self-correction accuracy. Experiments demonstrate that the IoE-based prompt enhances the accuracy of self-corrected responses, with code available on GitHub.
MBZUAI researchers introduce M4GT-Bench, a new benchmark for evaluating machine-generated text (MGT) detection across multiple languages and domains. The benchmark includes tasks for binary MGT detection, identifying the specific model that generated the text, and detecting mixed human-machine text. Experiments with baseline models and human evaluation show that MGT detection performance is highly dependent on access to training data from the same domain and generators.
Researchers from the National Center for AI in Saudi Arabia investigated the sensitivity of Large Language Model (LLM) leaderboards to minor benchmark perturbations. They found that small changes, like choice order, can shift rankings by up to 8 positions. The study recommends hybrid scoring and warns against over-reliance on simple benchmark evaluations, providing code for further research.