Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Spider
Dataset Summary
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.
Supported Tasks and Leaderboards
The leaderboard can be seen at https://yale-lily.github.io/spider
Languages
The text in the dataset is in English.
Dataset Structure
Data… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/spider.
Spider dataset is used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Spider Schema
Dataset Summary
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. This dataset contains the 166 databases used in the Spider dataset.
Yale Lily Spider Leaderboards
The leaderboard can be seen at https://yale-lily.github.io/spider
Languages
The text in… See the full description on the dataset page: https://huggingface.co/datasets/richardr1126/spider-schema.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Spider Skeleton Context Instruct
Dataset Summary
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. This dataset was created to finetune LLMs in a ### Instruction: and ### Response: format with database context.
Yale Lily Spider Leaderboards
The leaderboard can be seen at… See the full description on the dataset page: https://huggingface.co/datasets/richardr1126/spider-skeleton-context-instruct.
Spider 2.0 is a comprehensive code generation agent task that includes 632 examples. The agent has to interactively explore various types of databases, such as BigQuery, Snowflake, Postgres, ClickHouse, DuckDB, and SQLite. It is required to engage with complex SQL workflows, process extensive contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines across multiple interactions.
TURSpider is a novel Turkish Text-to-SQL dataset that includes complex queries, akin to those in the original Spider dataset. TURSpider dataset comprises two main subsets: a dev set and a training set, aligned with the structure and scale of the popular Spider dataset. The dev set contains 1034 data rows with 1023 unique questions and 584 distinct SQL queries. In the training set, there are 8659 data rows, 8506 unique questions, and corresponding SQL queries.
https://www.apache.org/licenses/LICENSE-2.0.htmlhttps://www.apache.org/licenses/LICENSE-2.0.html
SpiderDec is an extension of the Spider Dataset. The original Spider dataset split the data into training, development, and a hidden test set. For this new dataset, we manually decomposed the questions and corresponding queries within the development set of the Spider dataset, focusing on those with hard and extra hard SQL queries. The result of this effort is the creation of SpiderDec.
radhikachapaneri/spider-sql-prompts dataset hosted on Hugging Face and contributed by the HF Datasets community
MultiSpider is a large multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset consists of 8,034 entries designed to evaluate the performance of text-to-SQL models. Each entry contains a natural language text query and its corresponding SQL command. The dataset is a subset derived from the Spider dataset, focusing on diverse and complex queries to challenge the understanding and generation capabilities of machine learning models.
This dataset comprises 8,034 entries designed to assess the performance of text-to-SQL models. Each entry includes a natural language text query and its corresponding SQL command. It is a subset derived from the Spider dataset, focusing on diverse and complex queries to challenge machine learning models' understanding and generation capabilities. This is a free dataset, ideal for data science and analytics, particularly in natural language processing and deep learning applications.
The dataset consists of 8,034 entries. The text_query
column features 7,990 unique values, whilst the sql_command
column contains 4,525 unique values. Data files are typically provided in CSV format, and a sample file will be updated separately to the platform. The dataset is currently at version 1.0.
This dataset is ideal for evaluating the performance of text-to-SQL models. It can be utilised to challenge and enhance the understanding and generation capabilities of various machine learning models, especially within the domains of natural language processing and deep learning research and development.
The dataset's regional scope is global. The listing date for this dataset is noted as 08/06/2025.
CC-BY-SA
This dataset is intended for: * Machine learning researchers and developers who aim to train, test, or validate text-to-SQL models. * Professionals and academics working in natural language processing (NLP) and deep learning. * Data scientists and analysts focused on building and evaluating artificial intelligence models for natural language understanding and SQL generation.
Original Data Source: Text to SQL dataset
This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context. The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualised SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.
This dataset consists of 78,577 entries. Each entry represents a question about a database, the context of the database schema, and the corresponding SQL query. Data files are typically in CSV format. The 'pergunta' column contains 78,220 unique values, 'contexto' has 72,947 unique values, and 'resposta' has 78,577 unique values.
This dataset is ideal for: * Training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial. * Enhancing model performance in text-to-SQL tasks. * Supporting natural language processing and machine learning tasks related to generating structured queries from natural language.
The dataset has a global region scope and focuses on the Portuguese language. The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model. It was listed on 22/06/2025.
CC-BY-NC
This dataset is suitable for: * Data scientists and analysts focused on developing and refining natural language processing models. * Researchers and developers working on text-to-SQL solutions. * Anyone aiming to build or improve AI models that translate natural language queries into SQL, particularly for Portuguese.
Original Data Source: Portuguese Text2SQL database
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Spider Context Validation
Dataset Summary
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. This dataset was created to validate spider-fine-tuned LLMs with database context.
Yale Lily Spider Leaderboards
The leaderboard can be seen at https://yale-lily.github.io/spider… See the full description on the dataset page: https://huggingface.co/datasets/richardr1126/spider-context-validation.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
mjerome89/ORPRO-Spider-SQL-Filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
SQL-Eval is an open-source PostgreSQL evaluation dataset released by Defog, constructed based on Spider. The original link can be found at https://github.com/defog-ai/sql-eval. Our evaluation methodology is more stringent, as it compares the execution accuracy of the predicted SQL queries against the sole ground truth SQL query.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Cleaned Spider Dataset for Text2SQL
Dataset Summary
The Cleaned Spider Dataset for Text2SQL is an improved version of the original Spider dataset, which is a large-scale, complex, and cross-domain semantic parsing and text-to-SQL dataset. This enhanced version addresses several critical issues found in the original dataset, ensuring higher quality and reliability for training text-to-SQL models. The enhancements were made possible through Turbular's advanced data… See the full description on the dataset page: https://huggingface.co/datasets/Turbular/fixed_spider.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Link to original dataset: https://yale-lily.github.io/spider Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S. and Zhang, Z., 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fork of b-mc2/sql-create-context
Overview
This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from… See the full description on the dataset page: https://huggingface.co/datasets/philschmid/sql-create-context-copy.
VictorDCh/spider-clean-text-to-sql-4 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Distributed under the Creative Commons-by-sa-4.0 respecting the ShareAlike of the Spider Dataset. Code explanations and links for the model's checkpoints and datasets are on Github mRAT-SQL Here is the Hugging Face collection, you can download the model's checkpoints and datasets, but to understand is better to go to Github mRAT-SQL.
mRAT-SQL-FIT
A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention
Marcelo Archanjo Jose, Fabio… See the full description on the dataset page: https://huggingface.co/datasets/Marchanjo/spider-en-pt-es-fr.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Spider
Dataset Summary
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.
Supported Tasks and Leaderboards
The leaderboard can be seen at https://yale-lily.github.io/spider
Languages
The text in the dataset is in English.
Dataset Structure
Data… See the full description on the dataset page: https://huggingface.co/datasets/xlangai/spider.