Researchers at MBZUAI release SlimPajama-DC, an empirical analysis of data combinations for pretraining LLMs using the SlimPajama dataset. The study examines the impact of global vs. local deduplication and the proportions of highly-deduplicated multi-source datasets. Results show that increased data diversity after global deduplication is crucial, with the best configuration outperforming models trained on RedPajama.
KAUST's Computational Bioscience Research Center (CBRC) held a Research Conference on Big Data Analyses in Evolutionary Biology. The conference focused on the impact of large "omics" datasets on evolutionary biology, requiring big data approaches for analysis. Researchers discussed how computer science can contribute to biology and vice versa. Why it matters: Such interdisciplinary events at KAUST can foster innovation at the intersection of computational science and biology, advancing research in both fields.
KAUST held a research conference on Computational and Statistical Interface to Big Data from March 19-21. The conference covered topics like data representation, visualization, parallel algorithms, and large-scale machine learning. Participants came from institutions including the American University of Sharjah, Aalborg University, and others to exchange ideas. Why it matters: The conference highlights KAUST's focus on promoting big data research and collaboration to address challenges and opportunities in various scientific fields within the Kingdom and globally.
MBZUAI Professor Fakhri Karray and co-authors from the University of Waterloo have published "Elements of Dimensionality Reduction and Manifold Learning," a textbook on methods for extracting useful components from large datasets. The book addresses the challenge of the "curse of dimensionality," where growth in datasets complicates their use in machine learning. Karray developed the material from a popular course he taught at Waterloo. Why it matters: The textbook provides a unified resource for students and researchers in machine learning and AI, addressing a foundational challenge in processing high-dimensional data, relevant to diverse applications in the region.
Holger Pirk from Imperial College London is developing a novel approach to data management system composition called BOSS. The system uses a homoiconic representation of data and code and partial evaluation of queries by components, drawing inspiration from compiler-construction research. BOSS achieves a fully composable design that effectively combines different data models, hardware platforms, and processing engines, enabling features like GPU acceleration and generative data cleaning with minimal overhead. Why it matters: This research on composable database systems can broaden the applicability of data management techniques in the GCC region, enabling more flexible and efficient data processing for various applications.
KAUST Professor Raul Tempone, an expert in Uncertainty Quantification (UQ), has been appointed as an Alexander von Humboldt Professor at RWTH Aachen University in Germany. This professorship will enable him to further his research on mathematics for uncertainty quantification with new collaborators. Tempone believes the KAUST Strategic Initiative for Uncertainty Quantification (SRI-UQ) contributed to this award. Why it matters: This appointment enhances KAUST's visibility and facilitates cross-fertilization between European and KAUST research groups, benefiting both institutions and attracting talent.
A new paper from MBZUAI researchers explores using ChatGPT to combat the spread of fake news. The researchers, including Preslav Nakov and Liangming Pan, demonstrate that ChatGPT can be used to fact-check published information. Their paper, "Fact-Checking Complex Claims with Program-Guided Reasoning," was accepted at ACL 2023. Why it matters: This research highlights the potential of large language models to address the growing challenge of misinformation, with implications for maintaining information integrity in the digital age.
Machine learning (ML) algorithms use data to make decisions or predictions, improving over time as more data is provided. ML is a subset of AI, focused on models that learn from data, contrasting with rule-based systems. ML is superior in scenarios where rules are not exhaustive, such as medical scans, but rule-based systems and ML often complement each other. Why it matters: This overview clarifies the role of machine learning within the broader field of AI, highlighting its data-driven approach and its advantages over traditional rule-based systems in complex decision-making scenarios.