Skip to content
GCC AI Research

Search

Results for "data heterogeneity"

Data diagnostics: AI and statistics in computational biology and smart health

MBZUAI ·

MBZUAI's AI Quorum workshop featured Yale biostatistics professor Heping Zhang discussing the challenges of using AI and statistics to analyze noisy biological data for health insights. Zhang highlighted the need to develop methods to extract meaningful stories from noisy data to understand brain function and genetic roles in disease regulation. Harvard's Xihong Lin presented recommendations for building an ecosystem using AI and statistics to improve understanding of the relationship between genome sequences and biological functions. Why it matters: This discussion underscores the importance of AI and statistical methods in addressing the complexities of biological data, particularly in understanding neurological diseases like Alzheimer's, and highlights the need for centralized data infrastructure.

Building Planetary-Scale Collaborative Intelligence

MBZUAI ·

Sai Praneeth Karimireddy from UC Berkeley presented a talk on building planetary-scale collaborative intelligence, highlighting the challenges of using distributed data in machine learning due to data silos and ethical-legal restrictions. He proposed collaborative systems like federated learning as a solution to bring together distributed data while respecting privacy. The talk addressed the need for efficiency, reliability, and management of divergent goals in these systems, suggesting the use of tools from optimization, statistics, and economics. Why it matters: Collaborative AI systems can unlock valuable distributed data in the region, especially in sensitive sectors like healthcare, while ensuring privacy and addressing ethical concerns.

Bring Your Own Kernel! Constructing High-Performance Data Management Systems from Components

MBZUAI ·

Holger Pirk from Imperial College London is developing a novel approach to data management system composition called BOSS. The system uses a homoiconic representation of data and code and partial evaluation of queries by components, drawing inspiration from compiler-construction research. BOSS achieves a fully composable design that effectively combines different data models, hardware platforms, and processing engines, enabling features like GPU acceleration and generative data cleaning with minimal overhead. Why it matters: This research on composable database systems can broaden the applicability of data management techniques in the GCC region, enabling more flexible and efficient data processing for various applications.

PDNS-Net: A Large Heterogeneous Graph Benchmark Dataset of Network Resolutions for Graph Learning

arXiv ·

The Qatar Computing Research Institute (QCRI) has introduced PDNS-Net, a large heterogeneous graph dataset for malicious domain classification, containing 447K nodes and 897K edges. It is significantly larger than existing heterogeneous graph datasets like IMDB and DBLP. Preliminary evaluations using graph neural networks indicate that further research is needed to improve model performance on large heterogeneous graphs. Why it matters: This dataset will enable researchers to develop and benchmark graph learning algorithms on a scale relevant to real-world cybersecurity applications, particularly for identifying and mitigating malicious online activity.

On Transferability of Machine Learning Models

MBZUAI ·

This article discusses domain shift in machine learning, where testing data differs from training data, and methods to mitigate it via domain adaptation and generalization. Domain adaptation uses labeled source data and unlabeled target data. Domain generalization uses labeled data from single or multiple source domains to generalize to unseen target domains. Why it matters: Research in mitigating domain shift enhances the robustness and applicability of AI models in diverse real-world scenarios.

Duet: efficient and scalable hybriD neUral rElation undersTanding

arXiv ·

The paper introduces Duet, a hybrid neural relation understanding method for cardinality estimation. Duet addresses limitations of existing learned methods, such as high costs and scalability issues, by incorporating predicate information into an autoregressive model. Experiments demonstrate Duet's efficiency, accuracy, and scalability, even outperforming GPU-based methods on CPU.

A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation

arXiv ·

This paper introduces a unified deep autoregressive model (UAE) for cardinality estimation that learns joint data distributions from both data and query workloads. It uses differentiable progressive sampling with the Gumbel-Softmax trick to incorporate supervised query information into the deep autoregressive model. Experiments show UAE achieves better accuracy and efficiency compared to state-of-the-art methods.

Enabling Fast, Robust, and Personalized Federated Learning

MBZUAI ·

A talk at MBZUAI discussed federated learning, a distributed machine learning approach training models over devices while keeping data localized. The presentation covered a straggler-resilient federated learning scheme using adaptive node participation to tackle system heterogeneity. It also presented a robust optimization formulation for addressing data heterogeneity and a new algorithm for personalizing learned models. Why it matters: Federated learning is crucial for AI applications involving decentralized data sources, and research on improving its robustness and personalization is essential for real-world deployment in the region.