Skip to content
GCC AI Research

Search

Results for "Spider dataset"

AraSpider: Democratizing Arabic-to-SQL

arXiv ·

The study introduces AraSpider, the first Arabic version of the Spider dataset, to advance Arabic NLP. Four multilingual translation models and two text-to-SQL models (ChatGPT 3.5 and SQLCoder) were evaluated. Back translation significantly improved the performance of both ChatGPT 3.5 and SQLCoder on the AraSpider dataset. Why it matters: This work democratizes access to text-to-SQL resources for Arabic speakers and provides a methodology for translating datasets to other languages.

The Saudi Privacy Policy Dataset

arXiv ·

A new dataset called the Saudi Privacy Policy Dataset is introduced, which contains Arabic privacy policies from various sectors in Saudi Arabia. The dataset is annotated based on the 10 principles of the Personal Data Protection Law (PDPL) and includes 1,000 websites, 4,638 lines of text, and 775,370 tokens. The dataset aims to facilitate research and development in privacy policy analysis, NLP, and machine learning applications related to data protection.

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

arXiv ·

MBZUAI researchers introduce Web2Code, a new large-scale dataset and evaluation framework for training and benchmarking multimodal LLMs on webpage understanding and HTML code generation. The dataset includes webpage images, HTML code, and QA pairs about webpage content. Experiments demonstrate the dataset's utility in webpage understanding, code generation, and general visual domain tasks, with code and data available on Github.

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs

arXiv ·

The Qatar Computing Research Institute (QCRI) has released SpokenNativQA, a multilingual spoken question-answering dataset for evaluating LLMs in conversational settings. The dataset contains 33,000 naturally spoken questions and answers across multiple languages, including low-resource and dialect-rich languages. It aims to address the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. Why it matters: This benchmark enables more robust evaluation of LLMs in speech-based interactions, particularly for Arabic dialects and other low-resource languages.

PDNS-Net: A Large Heterogeneous Graph Benchmark Dataset of Network Resolutions for Graph Learning

arXiv ·

The Qatar Computing Research Institute (QCRI) has introduced PDNS-Net, a large heterogeneous graph dataset for malicious domain classification, containing 447K nodes and 897K edges. It is significantly larger than existing heterogeneous graph datasets like IMDB and DBLP. Preliminary evaluations using graph neural networks indicate that further research is needed to improve model performance on large heterogeneous graphs. Why it matters: This dataset will enable researchers to develop and benchmark graph learning algorithms on a scale relevant to real-world cybersecurity applications, particularly for identifying and mitigating malicious online activity.

101 Billion Arabic Words Dataset

arXiv ·

Researchers compiled a 101 Billion Arabic Words Dataset by mining text from Common Crawl WET files and rigorously cleaning and deduplicating the extracted content. The dataset aims to address the scarcity of original, high-quality Arabic linguistic data, which often leads to bias in Arabic LLMs that rely on translated English data. This is the largest Arabic dataset available to date. Why it matters: The new dataset can significantly contribute to the development of authentic Arabic LLMs that are more linguistically and culturally accurate.

Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

arXiv ·

A new culturally inclusive and linguistically diverse dataset called Palm for Arabic LLMs is introduced, covering 22 Arab countries and featuring instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA) across 20 topics. The dataset was built through a year-long community-driven project involving 44 researchers from across the Arab world. Evaluation of frontier LLMs using the dataset reveals limitations in cultural and dialectal understanding, with some countries being better represented than others.

SlimPajama-DC: Understanding Data Combinations for LLM Training

arXiv ·

Researchers at MBZUAI release SlimPajama-DC, an empirical analysis of data combinations for pretraining LLMs using the SlimPajama dataset. The study examines the impact of global vs. local deduplication and the proportions of highly-deduplicated multi-source datasets. Results show that increased data diversity after global deduplication is crucial, with the best configuration outperforming models trained on RedPajama.