Skip to content
GCC AI Research

AraSpider: Democratizing Arabic-to-SQL

arXiv · · Notable

Summary

The study introduces AraSpider, the first Arabic version of the Spider dataset, to advance Arabic NLP. Four multilingual translation models and two text-to-SQL models (ChatGPT 3.5 and SQLCoder) were evaluated. Back translation significantly improved the performance of both ChatGPT 3.5 and SQLCoder on the AraSpider dataset. Why it matters: This work democratizes access to text-to-SQL resources for Arabic speakers and provides a methodology for translating datasets to other languages.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

arXiv ·

The paper introduces NativQA, a language-independent framework for constructing culturally and regionally aligned QA datasets in native languages. Using the framework, the authors created MultiNativQA, a multilingual natural QA dataset consisting of ~64k manually annotated QA pairs in seven languages. The dataset covers queries from native speakers from 9 regions covering 18 topics, and is designed for evaluating and tuning LLMs. Why it matters: The framework and dataset enable the creation of more culturally relevant and effective LLMs for diverse linguistic communities, including those in the Middle East.

ALARB: An Arabic Legal Argument Reasoning Benchmark

arXiv ·

Researchers introduce ALARB, a new benchmark for evaluating reasoning in Arabic LLMs using 13K Saudi commercial court cases. The benchmark includes tasks like verdict prediction, reasoning chain completion, and identification of relevant regulations. Instruction-tuning a 12B parameter model on ALARB achieves performance comparable to GPT-4o in verdict prediction and generation.

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

arXiv ·

Researchers introduce ArabicaQA, a large-scale dataset for Arabic question answering, comprising 89,095 answerable and 3,701 unanswerable questions. They also present AraDPR, a dense passage retrieval model trained on the Arabic Wikipedia. The paper includes benchmarking of large language models (LLMs) for Arabic question answering. Why it matters: This work addresses a significant gap in Arabic NLP resources and provides valuable tools and benchmarks for advancing research in the field.

A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation

arXiv ·

This paper introduces a unified deep autoregressive model (UAE) for cardinality estimation that learns joint data distributions from both data and query workloads. It uses differentiable progressive sampling with the Gumbel-Softmax trick to incorporate supervised query information into the deep autoregressive model. Experiments show UAE achieves better accuracy and efficiency compared to state-of-the-art methods.