Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NL2SQL for BI dataset
geetu040/nl2sql-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
JasperHaozhe/NL2SQL-Queries dataset hosted on Hugging Face and contributed by the HF Datasets community
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing. BIRD contains over 12,751 unique question-SQL pairs and 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.
SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written by real users of the Stack Exchange Data Explorer out of a natural interaction. These pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset. The goal of this dataset is to take a significant step towards evaluation of Text-to-SQL models in a real-world setting. Compared to other Text-to-SQL datasets, SEDE contains at least 10 times more SQL queries templates (queries after canonization and anonymization of values) than other datasets, and has the most diverse set of utterances and SQL queries (in terms of 3-grams) out of all single-domain datasets. SEDE introduces real-world challenges, such as under-specification, usage of parameters in queries, dates manipulation and more.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Spider
Dataset Summary
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.
Supported Tasks and Leaderboards
The leaderboard can be seen at https://yale-lily.github.io/spider
Languages
The text in the dataset is in English.
Dataset Structure
Data… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/spider.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
Peixian Ma1,2
Xialie Zhuang1,3
Chengjin Xu1,4
Xuhui Jiang1,4
Ran Chen1
Jian Guo1
1IDEA Research, International Digital Economy Academy
2The Hong Kong University of Science and Technology (Guangzhou)
3University of Chinese Academy of Science
4DataArc Tech Ltd.
📖 Overview
Natural Language to SQL (NL2SQL) enables intuitive interactions… See the full description on the dataset page: https://huggingface.co/datasets/MPX0222forHF/SynSQL-Complex-5K.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Summary
NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs. For more… See the full description on the dataset page: https://huggingface.co/datasets/NumbersStation/NSText2SQL.
Not seeing a result you expected?
Learn how you can add new datasets to our index.