This dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
synthetic_text_to_sql
gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:
105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Fork of gretelai/synthetic_text_to_sql
The gretelai/synthetic_text_to_sql dataset is a large, Apache 2.0 licensed, synthetic Text-to-SQL dataset consisting of 105,851 high-quality records across 100 diverse domains, designed for training language models. It includes comprehensive SQL tasks with varying complexities, database contexts, natural language explanations, and contextual tags, outperforming existing datasets in SQL correctness and standards compliance.
https://choosealicense.com/licenses/openrail++/https://choosealicense.com/licenses/openrail++/
BiomedSQL GitHub Dataset Summary BiomedSQL is a text-to-SQL benchmark designed to evaluate Large Language Models (LLMs) on scientific tabular reasoning tasks. It consists of curated question-SQL query-answer triples covering a variety of biomedical and SQL reasoning types. The benchmark challenges models to apply implicit scientific criteria rather than simply translating syntax. benchmark_data: contains the question-SQL query-answer triples. db_data: contains the parquet files needed to… See the full description on the dataset page: https://huggingface.co/datasets/NIH-CARD/BiomedSQL.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Cleaned Spider Dataset for Text2SQL
Dataset Summary
The Cleaned Spider Dataset for Text2SQL is an improved version of the original Spider dataset, which is a large-scale, complex, and cross-domain semantic parsing and text-to-SQL dataset. This enhanced version addresses several critical issues found in the original dataset, ensuring higher quality and reliability for training text-to-SQL models. The enhancements were made possible through Turbular's advanced data… See the full description on the dataset page: https://huggingface.co/datasets/Turbular/fixed_spider.
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
WikiSQL Dataset (Reformatted for Generative Models)
This is the exact same dataset as WikiSQL: https://huggingface.co/datasets/wikisql, but with the data reformatted to allow direct use with text generation LLMs. The original license and credits for the original dataset remain in place. Specifically, the changes from standard WikiSQL are:
The table details in WikiSQL were included as dictionaries but tools like LangChain and LlamaIndex build their prompts using a SQL DESCRIBE of… See the full description on the dataset page: https://huggingface.co/datasets/tjaffri/wikisql-generate.
Dataset Card for "Llama-2-SQL-and-Code-Dataset"
This dataset is intended to provide LLaMA 2 improved coding and instruction following capabilities, with a specific focus on SQL generation. The dataset is in Alpaca Instruct format. Please be sure to provide the instruction and input in the prompt to the model, along with any prompt text you would like to place around those inputs. In the train split, please ignore the table column. The eval split provides example tables so that the… See the full description on the dataset page: https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-and-Code-Dataset.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
SLM-SQL: An Exploration of Small Language Models for Text-to-SQL
Important Links
📖Arxiv Paper | 🤗HuggingFace | 🤖ModelScope |
News
July 31, 2025: Upload model to modelscope and huggingface. July 30, 2025: Publish the paper to arxiv
Introduction
Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to… See the full description on the dataset page: https://huggingface.co/datasets/cycloneboy/SynsQL-Think-916k.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Summary
OGText2SQL dataset was utilized in training the OGSQL model, this dataset comprises over 350,000 rows of text-to-SQL pairs. Through a series of data refining steps, including schema expansion, SQL refinement, and instruction generation using existing Language Models (LLMs), the dataset was meticulously processed to ensure quality and relevance.
How to use it
Python
from datasets import load_dataset
dataset = load_dataset("OneGate/OGText2SQL")
API… See the full description on the dataset page: https://huggingface.co/datasets/OneGate/OGText2SQL.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Summary
This dataset has been created by Re:cast AI to extend the existing dataset b-mc2/sql-create-context into a chatml friendly format for use in SFT tasks with pretrained models.
Dataset Structure
messages = [ {'content': "You are a powerful text-to-SQL AI assistant that helps users ... etc.", 'role': 'system'}, {'content': '(Optional) Context information is below ... etc.', 'role': 'user'}, {'content': 'SELECT COUNT(*) FROM head WHERE age > 56'… See the full description on the dataset page: https://huggingface.co/datasets/recastai/sql-create-context-chatml.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for SynQL-KaggleDBQA-Train-Topics
Developed by: Semiotic Labs Model type: [Text to SQL] License: [Apache-2.0]
Dataset Details
Example view of data: { "StudentMathScore": { "1": "Federal Revenue Data (Questions about federal revenue information related to different states and school districts)", "2": "Math Score Data (Questions about the average math scores in different states for grade 8 students)", "3": "Revenue Key Data… See the full description on the dataset page: https://huggingface.co/datasets/semiotic/SynQL-KaggleDBQA-Topics.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for SynQL-Spider-Train-Source-Templates
Developed by: Semiotic Labs Model type: [Text to SQL] License: [Apache-2.0]
Dataset Details
Example view of data: { "0": "SELECT ? FROM ? JOIN ? ON ? = ? WHERE ? IN (SELECT ? FROM ? JOIN ? ON ? = ? WHERE ? = ?)", "1": "SELECT COUNT(?) FROM ? JOIN ? ON ? = ? WHERE ? != ?", "2": "SELECT ? FROM ? WHERE ? = ? OR ? = ?", "3": "SELECT ? FROM ? WHERE ? LIKE ? ORDER BY ?", ... "911": "SELECT ? FROM ?… See the full description on the dataset page: https://huggingface.co/datasets/semiotic/SynQL-Spider-Train-Source-Templates.
BirdBench Dataset in DuckDB format
BirdBench is a benchmark for text-to-SQL capabilities, now available in DuckDB format for improved performance and usability.
About BirdBench
BirdBench is a comprehensive benchmark dataset for evaluating text-to-SQL capabilities of language models. It features a diverse collection of databases spanning various domains including:
Business and finance Entertainment and media Sports and recreation Health and medicine Education Travel and… See the full description on the dataset page: https://huggingface.co/datasets/ucalyptus/birdbench-duckdb.
Spider-1C: A Text-to-SQL Dataset Adapted for 1C:Enterprise
Spider-1C is a specialized adaptation of the well-known Spider dataset designed specifically for the task of Text-to-SQL within the context of 1C:Enterprise, a widely used ERP platform in Russia and other countries. This dataset facilitates the training and evaluation of large language models (LLMs) aimed at converting natural language queries (primarily in Russian) into 1C-specific SQL queries (1C Query Language).… See the full description on the dataset page: https://huggingface.co/datasets/kavlab/Spider-1C.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.