Skip to content
GCC AI Research

Search

Results for "dataset"

TII-SSRC-23 Dataset: Typological Exploration of Diverse Traffic Patterns for Intrusion Detection

arXiv ·

Researchers introduce TII-SSRC-23, a new network intrusion detection dataset designed to improve the diversity and representation of modern network traffic for machine learning models. The dataset includes a range of traffic types and subtypes to address the limitations of existing datasets. Feature importance analysis and baseline experiments for supervised and unsupervised intrusion detection are also provided.

Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

arXiv ·

A new culturally inclusive and linguistically diverse dataset called Palm for Arabic LLMs is introduced, covering 22 Arab countries and featuring instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA) across 20 topics. The dataset was built through a year-long community-driven project involving 44 researchers from across the Arab world. Evaluation of frontier LLMs using the dataset reveals limitations in cultural and dialectal understanding, with some countries being better represented than others.

ArabJobs: A Multinational Corpus of Arabic Job Ads

arXiv ·

The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.

Universal Adversarial Examples in Remote Sensing: Methodology and Benchmark

arXiv ·

This paper introduces a novel black-box adversarial attack method, Mixup-Attack, to generate universal adversarial examples for remote sensing data. The method identifies common vulnerabilities in neural networks by attacking features in the shallow layer of a surrogate model. The authors also present UAE-RS, the first dataset of black-box adversarial samples in remote sensing, to benchmark the robustness of deep learning models against adversarial attacks.

AlexU-Word: A New Dataset for Isolated-Word Closed-Vocabulary Offline Arabic Handwriting Recognition

arXiv ·

Researchers from Alexandria University introduce AlexU-Word, a new dataset for offline Arabic handwriting recognition. The dataset contains 25,114 samples of 109 unique Arabic words, covering all letter shapes, collected from 907 writers. The dataset is designed for closed-vocabulary word recognition and to support segmented letter recognition-based systems. Why it matters: This dataset can help advance Arabic handwriting recognition systems, addressing a need for high-quality Arabic datasets in NLP research.

Race Against the Machine: a Fully-annotated, Open-design Dataset of Autonomous and Piloted High-speed Flight

arXiv ·

Researchers at the Technology Innovation Institute (TII) have released a fully-annotated dataset for autonomous drone racing, called "Race Against the Machine." The dataset includes high-resolution visual, inertial, and motion capture data from both autonomous and piloted flights, along with commands, control inputs, and corner-level labeling of drone racing gates. The specifications to recreate their flight platform using commercial off-the-shelf components and the Betaflight controller are also released. Why it matters: This comprehensive resource aims to support the development of new methods and establish quantitative comparisons for approaches in robotics and AI, democratizing drone racing research.

PDNS-Net: A Large Heterogeneous Graph Benchmark Dataset of Network Resolutions for Graph Learning

arXiv ·

The Qatar Computing Research Institute (QCRI) has introduced PDNS-Net, a large heterogeneous graph dataset for malicious domain classification, containing 447K nodes and 897K edges. It is significantly larger than existing heterogeneous graph datasets like IMDB and DBLP. Preliminary evaluations using graph neural networks indicate that further research is needed to improve model performance on large heterogeneous graphs. Why it matters: This dataset will enable researchers to develop and benchmark graph learning algorithms on a scale relevant to real-world cybersecurity applications, particularly for identifying and mitigating malicious online activity.

Commonsense Reasoning in Arab Culture

arXiv ·

A new dataset called ArabCulture is introduced to address the lack of culturally relevant commonsense reasoning resources in Arabic AI. The dataset covers 13 countries across the Gulf, Levant, North Africa, and the Nile Valley, spanning 12 daily life domains with 54 fine-grained subtopics. It was built from scratch by native speakers writing and validating culturally relevant questions. Why it matters: The dataset highlights the need for more culturally aware models and benchmarks tailored to the Arabic-speaking world, moving beyond machine-translated resources.