Skip to content
GCC AI Research

Search

Results for "Dataset"

TII-SSRC-23 Dataset: Typological Exploration of Diverse Traffic Patterns for Intrusion Detection

arXiv ·

Researchers introduce TII-SSRC-23, a new network intrusion detection dataset designed to improve the diversity and representation of modern network traffic for machine learning models. The dataset includes a range of traffic types and subtypes to address the limitations of existing datasets. Feature importance analysis and baseline experiments for supervised and unsupervised intrusion detection are also provided.

Universal Adversarial Examples in Remote Sensing: Methodology and Benchmark

arXiv ·

This paper introduces a novel black-box adversarial attack method, Mixup-Attack, to generate universal adversarial examples for remote sensing data. The method identifies common vulnerabilities in neural networks by attacking features in the shallow layer of a surrogate model. The authors also present UAE-RS, the first dataset of black-box adversarial samples in remote sensing, to benchmark the robustness of deep learning models against adversarial attacks.

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs

arXiv ·

The Qatar Computing Research Institute (QCRI) has released SpokenNativQA, a multilingual spoken question-answering dataset for evaluating LLMs in conversational settings. The dataset contains 33,000 naturally spoken questions and answers across multiple languages, including low-resource and dialect-rich languages. It aims to address the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. Why it matters: This benchmark enables more robust evaluation of LLMs in speech-based interactions, particularly for Arabic dialects and other low-resource languages.

Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

arXiv ·

The researchers introduce KAU-CSSL, the first continuous Saudi Sign Language (SSL) dataset focusing on complete sentences. They propose a transformer-based model using ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies. The model achieved 99.02% accuracy in signer-dependent mode and 77.71% in signer-independent mode, advancing communication tools for the SSL community.

JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

arXiv ·

Researchers have introduced JobArabi, a new large-scale corpus consisting of 20,528 Arabic job announcements collected from X between January 2024 and October 2025. The dataset was compiled using a linguistically informed query framework covering various Arabic recruitment expressions, offering metadata like timestamps and geolocation for detailed analysis. Quantitative analysis of JobArabi reveals sociolinguistic patterns, including persistent gendered hiring language, regional occupational demand variations, and emotional framing in recruitment messages. Why it matters: This corpus provides a valuable resource for research in Arabic NLP, computational social science, and digital labor studies, offering unique insights into labor market communication and linguistic change in the Arab world.