Search

Results for "data combinations"

SlimPajama-DC: Understanding Data Combinations for LLM Training

arXiv · Sep 19

Researchers at MBZUAI release SlimPajama-DC, an empirical analysis of data combinations for pretraining LLMs using the SlimPajama dataset. The study examines the impact of global vs. local deduplication and the proportions of highly-deduplicated multi-source datasets. Results show that increased data diversity after global deduplication is crucial, with the best configuration outperforming models trained on RedPajama.

Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data

arXiv · Jun 8

This paper introduces an interpretable pipeline that integrates mobility and social media data to analyze human behavior during crises. The framework was evaluated through two case studies, including a longitudinal analysis of UAE COVID-19 behavior from March 2020 to December 2021. The pipeline aligns heterogeneous daily signals, transforms them into binary behavioral states, applies Formal Concept Analysis (FCA) to extract co-occurrence structures, and mines association rules. Results demonstrate clear cross-domain behavioral structures in crises, yielding both scientifically credible and policy-actionable intelligence. Why it matters: This work provides a novel methodological approach for developing actionable crisis management strategies by fusing multimodal data, directly applicable to public health and emergency response in the UAE and the broader region.

Examining how technology informs science

KAUST · Jan 23

KAUST's Computational Bioscience Research Center (CBRC) held a Research Conference on Big Data Analyses in Evolutionary Biology. The conference focused on the impact of large "omics" datasets on evolutionary biology, requiring big data approaches for analysis. Researchers discussed how computer science can contribute to biology and vice versa. Why it matters: Such interdisciplinary events at KAUST can foster innovation at the intersection of computational science and biology, advancing research in both fields.

Exploring science's fourth paradigm

KAUST · May 7

KAUST held a research conference on Computational and Statistical Interface to Big Data from March 19-21. The conference covered topics like data representation, visualization, parallel algorithms, and large-scale machine learning. Participants came from institutions including the American University of Sharjah, Aalborg University, and others to exchange ideas. Why it matters: The conference highlights KAUST's focus on promoting big data research and collaboration to address challenges and opportunities in various scientific fields within the Kingdom and globally.

Overcoming the curse of dimensionality

MBZUAI · Invalid Date

MBZUAI Professor Fakhri Karray and co-authors from the University of Waterloo have published "Elements of Dimensionality Reduction and Manifold Learning," a textbook on methods for extracting useful components from large datasets. The book addresses the challenge of the "curse of dimensionality," where growth in datasets complicates their use in machine learning. Karray developed the material from a popular course he taught at Waterloo. Why it matters: The textbook provides a unified resource for students and researchers in machine learning and AI, addressing a foundational challenge in processing high-dimensional data, relevant to diverse applications in the region.

Bring Your Own Kernel! Constructing High-Performance Data Management Systems from Components

MBZUAI · Invalid Date

Holger Pirk from Imperial College London is developing a novel approach to data management system composition called BOSS. The system uses a homoiconic representation of data and code and partial evaluation of queries by components, drawing inspiration from compiler-construction research. BOSS achieves a fully composable design that effectively combines different data models, hardware platforms, and processing engines, enabling features like GPU acceleration and generative data cleaning with minimal overhead. Why it matters: This research on composable database systems can broaden the applicability of data management techniques in the GCC region, enabling more flexible and efficient data processing for various applications.

The role of data-driven models in quantifying uncertainty

KAUST · Jul 15

KAUST Professor Raul Tempone, an expert in Uncertainty Quantification (UQ), has been appointed as an Alexander von Humboldt Professor at RWTH Aachen University in Germany. This professorship will enable him to further his research on mathematics for uncertainty quantification with new collaborators. Tempone believes the KAUST Strategic Initiative for Uncertainty Quantification (SRI-UQ) contributed to this award. Why it matters: This appointment enhances KAUST's visibility and facilitates cross-fertilization between European and KAUST research groups, benefiting both institutions and attracting talent.

Search

Results for "data combinations"

SlimPajama-DC: Understanding Data Combinations for LLM Training

Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data

Examining how technology informs science

Exploring science&#39;s fourth paradigm

Overcoming the curse of dimensionality

Bring Your Own Kernel! Constructing High-Performance Data Management Systems from Components

The role of data-driven models in quantifying uncertainty

Exploring science's fourth paradigm