This paper introduces Cross-Document Topic-Aligned (CDTA) chunking to address knowledge fragmentation in Retrieval-Augmented Generation (RAG) systems. CDTA identifies topics across documents, maps segments to topics, and synthesizes them into unified chunks. Experiments on HotpotQA and UAE legal texts show that CDTA improves faithfulness and citation accuracy compared to existing chunking methods, especially for complex queries requiring multi-hop reasoning.
Researchers at MBZUAI have introduced QRAFT, an LLM-based framework designed to automate the generation of fact-checking articles. The system mimics the writing workflow of human fact-checkers, aiming to bridge the gap between automated fact-checking systems and public dissemination. While QRAFT outperforms existing text-generation methods, it still falls short of expert-written articles, highlighting areas for further research.
This paper introduces BRIQA, a new method for automated assessment of artifact severity in pediatric brain MRI, which is important for diagnostic accuracy. BRIQA uses gradient-based loss reweighting and a rotating batching scheme to handle class imbalance in artifact severity levels. Experiments show BRIQA improves average macro F1 score from 0.659 to 0.706, especially for Noise, Zipper, Positioning and Contrast artifacts.
The paper introduces Duet, a hybrid neural relation understanding method for cardinality estimation. Duet addresses limitations of existing learned methods, such as high costs and scalability issues, by incorporating predicate information into an autoregressive model. Experiments demonstrate Duet's efficiency, accuracy, and scalability, even outperforming GPU-based methods on CPU.
The paper introduces the Prism Hypothesis, which posits a correspondence between an encoder's feature spectrum and its functional role, with semantic encoders capturing low-frequency components and pixel encoders retaining high-frequency information. Based on this, the authors propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details using a frequency-band modulator. Experiments on ImageNet and MS-COCO demonstrate that UAE effectively unifies semantic abstraction and pixel-level fidelity, achieving state-of-the-art performance.
This paper introduces a unified deep autoregressive model (UAE) for cardinality estimation that learns joint data distributions from both data and query workloads. It uses differentiable progressive sampling with the Gumbel-Softmax trick to incorporate supervised query information into the deep autoregressive model. Experiments show UAE achieves better accuracy and efficiency compared to state-of-the-art methods.
This paper introduces a novel fuzzy clustering method for circular time series based on a new dependence measure that considers circular arcs. The algorithm groups series generated from similar stochastic processes and demonstrates computational efficiency. The method is applied to time series of wind direction in Saudi Arabia, showcasing its practical potential.
The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.