This paper introduces GigaBERT, a customized bilingual BERT model pre-trained for Arabic NLP and English-to-Arabic zero-shot transfer learning. The study evaluates GigaBERT's performance on four information extraction tasks: named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction. Results show that GigaBERT outperforms mBERT, XLM-RoBERTa, and AraBERT in both supervised and zero-shot transfer settings. Why it matters: GigaBERT advances Arabic NLP by providing a high-performing, publicly available model tailored for the complexities of the Arabic language and cross-lingual applications.
Researchers at the American University of Beirut (AUB) have released AraBERT, a BERT model pre-trained specifically for Arabic language understanding. The model was trained on a large Arabic corpus and compared against multilingual BERT and other state-of-the-art methods. AraBERT achieved state-of-the-art performance on several tested Arabic NLP tasks including sentiment analysis, named entity recognition, and question answering. Why it matters: This release provides the Arabic NLP community with a high-performing, open-source language model, facilitating further research and development.
G42 has launched Nanda 87B, an open-source Hindi-English LLM developed by MBZUAI in collaboration with Inception and Cerebras. Nanda 87B is built upon Llama-3.1 70B and trained on a dataset with over 65 billion Hindi tokens. The model is engineered for real-world use being fluent in formal Hindi, casual speech, and Hinglish, and is designed for translation, summarization, instruction-following, and transliteration tasks. Why it matters: This release marks a major advancement in creating inclusive AI technology tailored for one of the world's largest linguistic communities.
Technology Innovation Institute (TII) in the UAE has launched Falcon 180B, an open access large language model with 180 billion parameters trained on 3.5 trillion tokens. Falcon 180B ranks first on the Hugging Face Leaderboard for pretrained LLMs, outperforming Meta's LLaMA 2 and nearing the performance of OpenAI's GPT-4 and Google's PaLM 2. The model is available for research and commercial use under the 'Falcon 180B TII License', based upon Apache 2.0. Why it matters: This release strengthens the UAE's position in AI development and promotes open access to advanced AI technology, fostering innovation and collaboration.
Dr. Mikhail Burtsev of the London Institute presented research on GENA-LM, a suite of transformer-based DNA language models. The talk addressed the challenge of scaling transformers for genomic sequences, proposing recurrent memory augmentation to handle long input sequences efficiently. This approach improves language modeling performance and holds promise for memory-intensive applications in bioinformatics. Why it matters: This research can significantly advance AI's capabilities in genomics by enabling the processing of much larger DNA sequences, with potential breakthroughs in understanding and treating diseases.
Researchers from MBZUAI have released MobiLlama, a fully transparent open-source 0.5 billion parameter Small Language Model (SLM). MobiLlama is designed for resource-constrained devices, emphasizing enhanced performance with reduced resource demands. The full training data pipeline, code, model weights, and checkpoints are available on Github.
Technology Innovation Institute (TII) in Abu Dhabi, in collaboration with LightOn, has launched NOOR, a 10 billion parameter Arabic natural language processing (NLP) model. The model was trained on a large, high-quality cross-domain Arabic dataset including web data, books, poetry, news, and technical information. It enables applications in automated summarization, chatbots, and personalized marketing. Why it matters: NOOR represents a significant advancement in Arabic NLP, potentially enabling more sophisticated AI applications tailored to the Arabic language and regional needs.