This study analyzes the evolution of data science vocabulary using 16,018 abstracts containing "data science" over 13 years. It identifies new vocabulary introduction and its integration into scientific literature using techniques like EDA, LSA, LDA, and N-grams. The research compares overall scientific publications with those specific to Saudi Arabia, identifying representative articles based on vocabulary usage. Why it matters: The work provides insights into the development of data science terminology and its specific adoption within the Saudi Arabian research landscape.
Researchers developed a semantic search tool for the Quran using Arabic NLP techniques. The tool was trained on a dataset of over 30 tafsirs (interpretations) of the Quran. Using the SNxLM model and cosine similarity, the tool identifies Quranic verses most relevant to a user's query, achieving a similarity score of up to 0.97. Why it matters: This tool could significantly improve access to the Quran's teachings for Arabic speakers and researchers, providing a valuable resource for religious study and understanding.
This paper introduces Cross-Document Topic-Aligned (CDTA) chunking to address knowledge fragmentation in Retrieval-Augmented Generation (RAG) systems. CDTA identifies topics across documents, maps segments to topics, and synthesizes them into unified chunks. Experiments on HotpotQA and UAE legal texts show that CDTA improves faithfulness and citation accuracy compared to existing chunking methods, especially for complex queries requiring multi-hop reasoning.
Hassan Sajjad from Dalhousie University presented research on exploring the latent space of AI models to assess their safety and trustworthiness. He discussed use cases where analyzing latent space helps understand the robustness-generalization tradeoff in adversarial training and evaluate language comprehension. Sajjad's work aims to build better AI models and increase trust in their capabilities by looking at model internals. Why it matters: Intrinsic evaluation of model internals will become important to improving AI safety and robustness.
This paper introduces a mutually-regularized dual collaborative variational auto-encoder (MD-CVAE) for recommendation systems, addressing the limitations of user-oriented auto-encoders (UAEs) in handling sparse ratings and new items. MD-CVAE integrates item content and user ratings within a variational framework, regularizing UAE weights with item content to avoid non-optimal convergence. A symmetric inference strategy eliminates the need for retraining when introducing new items, enhancing efficiency in dynamic recommendation scenarios. Why it matters: The MD-CVAE approach offers a practical solution for improving recommendation accuracy and efficiency, especially in scenarios with data sparsity and frequent item updates, relevant to e-commerce and content platforms in the Middle East.
The Inception Team presented a system for Semantic Question Similarity in Arabic as part of the NSURL 2019 Task 8. The system explores different methods for determining question similarity in Arabic. Their best result was an ensemble model using a pre-trained multilingual BERT model, achieving a 95.924% F1-Score and ranking first among nine participating teams. Why it matters: This demonstrates strong performance on a key Arabic NLP task, advancing the state-of-the-art in semantic understanding for the language.