15 datasets found

h
Data from: text-to-sql
huggingface.co
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mir Mudasir (2025). text-to-sql [Dataset]. https://huggingface.co/datasets/Mudasir692/text-to-sql
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 20, 2025
Authors
Mir Mudasir
Description
This dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.
h
synthetic_text_to_sql
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Image generated by DALL-E. See prompt for more details

synthetic_text_to_sql

gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
O
NSText2SQL
opendatalab.com
huggingface.co
zip
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NSText2SQL [Dataset]. https://opendatalab.com/OpenDataLab/NSText2SQL
Explore at:
zipAvailable download formats
Dataset updated
Jul 1, 2024
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs.
h
gretel-synthetic-text-to-sql
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp Schmid (2025). gretel-synthetic-text-to-sql [Dataset]. https://huggingface.co/datasets/philschmid/gretel-synthetic-text-to-sql
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 8, 2025
Authors
Philipp Schmid
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Fork of gretelai/synthetic_text_to_sql

The gretelai/synthetic_text_to_sql dataset is a large, Apache 2.0 licensed, synthetic Text-to-SQL dataset consisting of 105,851 high-quality records across 100 diverse domains, designed for training language models. It includes comprehensive SQL tasks with varying complexities, database contexts, natural language explanations, and contextual tags, outperforming existing datasets in SQL correctness and standards compliance.
h
BiomedSQL
huggingface.co
ollama.hf-mirror.com
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for Alzheimer’s and Related Dementias (CARD) (2025). BiomedSQL [Dataset]. https://huggingface.co/datasets/NIH-CARD/BiomedSQL
Explore at:
Dataset updated
May 28, 2025
Dataset authored and provided by
Center for Alzheimer’s and Related Dementias (CARD)
License
https://choosealicense.com/licenses/openrail++/https://choosealicense.com/licenses/openrail++/
Description
BiomedSQL GitHub Dataset Summary BiomedSQL is a text-to-SQL benchmark designed to evaluate Large Language Models (LLMs) on scientific tabular reasoning tasks. It consists of curated question-SQL query-answer triples covering a variety of biomedical and SQL reasoning types. The benchmark challenges models to apply implicit scientific criteria rather than simply translating syntax. benchmark_data: contains the question-SQL query-answer triples. db_data: contains the parquet files needed to… See the full description on the dataset page: https://huggingface.co/datasets/NIH-CARD/BiomedSQL.
h
fixed_spider
huggingface.co
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turbular (2024). fixed_spider [Dataset]. https://huggingface.co/datasets/Turbular/fixed_spider
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 14, 2024
Dataset authored and provided by
Turbular
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Cleaned Spider Dataset for Text2SQL

Dataset Summary

The Cleaned Spider Dataset for Text2SQL is an improved version of the original Spider dataset, which is a large-scale, complex, and cross-domain semantic parsing and text-to-SQL dataset. This enhanced version addresses several critical issues found in the original dataset, ensuring higher quality and reliability for training text-to-SQL models. The enhancements were made possible through Turbular's advanced data… See the full description on the dataset page: https://huggingface.co/datasets/Turbular/fixed_spider.
h
wikisql-generate
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taqi Jaffri, wikisql-generate [Dataset]. https://huggingface.co/datasets/tjaffri/wikisql-generate
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Taqi Jaffri
License
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
Description
WikiSQL Dataset (Reformatted for Generative Models)

This is the exact same dataset as WikiSQL: https://huggingface.co/datasets/wikisql, but with the data reformatted to allow direct use with text generation LLMs. The original license and credits for the original dataset remain in place. Specifically, the changes from standard WikiSQL are:

The table details in WikiSQL were included as dictionaries but tools like LangChain and LlamaIndex build their prompts using a SQL DESCRIBE of… See the full description on the dataset page: https://huggingface.co/datasets/tjaffri/wikisql-generate.
h
Llama-2-SQL-and-Code-Dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Hayduk, Llama-2-SQL-and-Code-Dataset [Dataset]. https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-and-Code-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Chris Hayduk
Description
Dataset Card for "Llama-2-SQL-and-Code-Dataset"

This dataset is intended to provide LLaMA 2 improved coding and instruction following capabilities, with a specific focus on SQL generation. The dataset is in Alpaca Instruct format. Please be sure to provide the instruction and input in the prompt to the model, along with any prompt text you would like to place around those inputs. In the train split, please ignore the table column. The eval split provides example tables so that the… See the full description on the dataset page: https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-and-Code-Dataset.
h
SynsQL-Think-916k
huggingface.co
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cycloneboy (2025). SynsQL-Think-916k [Dataset]. https://huggingface.co/datasets/cycloneboy/SynsQL-Think-916k
Explore at:
Dataset updated
Jul 31, 2025
Authors
cycloneboy
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
SLM-SQL: An Exploration of Small Language Models for Text-to-SQL

Important Links

📖Arxiv Paper | 🤗HuggingFace | 🤖ModelScope |

News

July 31, 2025: Upload model to modelscope and huggingface. July 30, 2025: Publish the paper to arxiv

Introduction

Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to… See the full description on the dataset page: https://huggingface.co/datasets/cycloneboy/SynsQL-Think-916k.
h
OGText2SQL
huggingface.co
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OneGate (2024). OGText2SQL [Dataset]. https://huggingface.co/datasets/OneGate/OGText2SQL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 24, 2024
Dataset authored and provided by
OneGate
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Summary

OGText2SQL dataset was utilized in training the OGSQL model, this dataset comprises over 350,000 rows of text-to-SQL pairs. Through a series of data refining steps, including schema expansion, SQL refinement, and instruction generation using existing Language Models (LLMs), the dataset was meticulously processed to ensure quality and relevance.

How to use it

Python

from datasets import load_dataset

dataset = load_dataset("OneGate/OGText2SQL")

API… See the full description on the dataset page: https://huggingface.co/datasets/OneGate/OGText2SQL.
h
sql-create-context-chatml
huggingface.co
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Re:cast AI (2024). sql-create-context-chatml [Dataset]. https://huggingface.co/datasets/recastai/sql-create-context-chatml
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2024
Dataset authored and provided by
Re:cast AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Summary

This dataset has been created by Re:cast AI to extend the existing dataset b-mc2/sql-create-context into a chatml friendly format for use in SFT tasks with pretrained models.

Dataset Structure

messages = [ {'content': "You are a powerful text-to-SQL AI assistant that helps users ... etc.", 'role': 'system'}, {'content': '(Optional) Context information is below ... etc.', 'role': 'user'}, {'content': 'SELECT COUNT(*) FROM head WHERE age > 56'… See the full description on the dataset page: https://huggingface.co/datasets/recastai/sql-create-context-chatml.
h
SynQL-KaggleDBQA-Topics
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semiotic Labs, SynQL-KaggleDBQA-Topics [Dataset]. https://huggingface.co/datasets/semiotic/SynQL-KaggleDBQA-Topics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Semiotic Labs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for SynQL-KaggleDBQA-Train-Topics

Developed by: Semiotic Labs Model type: [Text to SQL] License: [Apache-2.0]

Dataset Details

Example view of data: { "StudentMathScore": { "1": "Federal Revenue Data (Questions about federal revenue information related to different states and school districts)", "2": "Math Score Data (Questions about the average math scores in different states for grade 8 students)", "3": "Revenue Key Data… See the full description on the dataset page: https://huggingface.co/datasets/semiotic/SynQL-KaggleDBQA-Topics.
h
SynQL-Spider-Train-Source-Templates
huggingface.co
Updated Oct 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semiotic Labs (2024). SynQL-Spider-Train-Source-Templates [Dataset]. https://huggingface.co/datasets/semiotic/SynQL-Spider-Train-Source-Templates
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 26, 2024
Dataset authored and provided by
Semiotic Labs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for SynQL-Spider-Train-Source-Templates

Developed by: Semiotic Labs Model type: [Text to SQL] License: [Apache-2.0]

Dataset Details

Example view of data: { "0": "SELECT ? FROM ? JOIN ? ON ? = ? WHERE ? IN (SELECT ? FROM ? JOIN ? ON ? = ? WHERE ? = ?)", "1": "SELECT COUNT(?) FROM ? JOIN ? ON ? = ? WHERE ? != ?", "2": "SELECT ? FROM ? WHERE ? = ? OR ? = ?", "3": "SELECT ? FROM ? WHERE ? LIKE ? ORDER BY ?", ... "911": "SELECT ? FROM ?… See the full description on the dataset page: https://huggingface.co/datasets/semiotic/SynQL-Spider-Train-Source-Templates.
h
birdbench-duckdb
huggingface.co
Updated Mar 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayantan Das (2025). birdbench-duckdb [Dataset]. https://huggingface.co/datasets/ucalyptus/birdbench-duckdb
Explore at:
Dataset updated
Mar 27, 2025
Authors
Sayantan Das
Description
BirdBench Dataset in DuckDB format

BirdBench is a benchmark for text-to-SQL capabilities, now available in DuckDB format for improved performance and usability.

About BirdBench

BirdBench is a comprehensive benchmark dataset for evaluating text-to-SQL capabilities of language models. It features a diverse collection of databases spanning various domains including:

Business and finance Entertainment and media Sports and recreation Health and medicine Education Travel and… See the full description on the dataset page: https://huggingface.co/datasets/ucalyptus/birdbench-duckdb.
h
Spider-1C
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander, Spider-1C [Dataset]. https://huggingface.co/datasets/kavlab/Spider-1C
Explore at:
Authors
Alexander
Description
Spider-1C: A Text-to-SQL Dataset Adapted for 1C:Enterprise

Spider-1C is a specialized adaptation of the well-known Spider dataset designed specifically for the task of Text-to-SQL within the context of 1C:Enterprise, a widely used ERP platform in Russia and other countries. This dataset facilitates the training and evaluation of large language models (LLMs) aimed at converting natural language queries (primarily in Russian) into 1C-specific SQL queries (1C Query Language).… See the full description on the dataset page: https://huggingface.co/datasets/kavlab/Spider-1C.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mir Mudasir (2025). text-to-sql [Dataset]. https://huggingface.co/datasets/Mudasir692/text-to-sql

Data from: text-to-sql

Mudasir692/text-to-sql

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 20, 2025

Authors

Mir Mudasir

Description

This dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.

Clear search

Close search

Google apps

Main menu

Data from: text-to-sql

synthetic_text_to_sql

NSText2SQL

gretel-synthetic-text-to-sql

BiomedSQL

fixed_spider

wikisql-generate

Llama-2-SQL-and-Code-Dataset

SynsQL-Think-916k

OGText2SQL

sql-create-context-chatml

SynQL-KaggleDBQA-Topics

SynQL-Spider-Train-Source-Templates

birdbench-duckdb

Spider-1C

Data from: text-to-sql

Mudasir692/text-to-sql