15 datasets found
  1. h

    Data from: text-to-sql

    • huggingface.co
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mir Mudasir (2025). text-to-sql [Dataset]. https://huggingface.co/datasets/Mudasir692/text-to-sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2025
    Authors
    Mir Mudasir
    Description

    This dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.

  2. h

    synthetic_text_to_sql

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      synthetic_text_to_sql
    

    gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

    105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

  3. O

    NSText2SQL

    • opendatalab.com
    • huggingface.co
    zip
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). NSText2SQL [Dataset]. https://opendatalab.com/OpenDataLab/NSText2SQL
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 1, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs.

  4. h

    gretel-synthetic-text-to-sql

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philipp Schmid (2025). gretel-synthetic-text-to-sql [Dataset]. https://huggingface.co/datasets/philschmid/gretel-synthetic-text-to-sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 8, 2025
    Authors
    Philipp Schmid
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Fork of gretelai/synthetic_text_to_sql

    The gretelai/synthetic_text_to_sql dataset is a large, Apache 2.0 licensed, synthetic Text-to-SQL dataset consisting of 105,851 high-quality records across 100 diverse domains, designed for training language models. It includes comprehensive SQL tasks with varying complexities, database contexts, natural language explanations, and contextual tags, outperforming existing datasets in SQL correctness and standards compliance.

  5. h

    BiomedSQL

    • huggingface.co
    • ollama.hf-mirror.com
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Center for Alzheimer’s and Related Dementias (CARD) (2025). BiomedSQL [Dataset]. https://huggingface.co/datasets/NIH-CARD/BiomedSQL
    Explore at:
    Dataset updated
    May 28, 2025
    Dataset authored and provided by
    Center for Alzheimer’s and Related Dementias (CARD)
    License

    https://choosealicense.com/licenses/openrail++/https://choosealicense.com/licenses/openrail++/

    Description

    BiomedSQL GitHub Dataset Summary BiomedSQL is a text-to-SQL benchmark designed to evaluate Large Language Models (LLMs) on scientific tabular reasoning tasks. It consists of curated question-SQL query-answer triples covering a variety of biomedical and SQL reasoning types. The benchmark challenges models to apply implicit scientific criteria rather than simply translating syntax. benchmark_data: contains the question-SQL query-answer triples. db_data: contains the parquet files needed to… See the full description on the dataset page: https://huggingface.co/datasets/NIH-CARD/BiomedSQL.

  6. h

    fixed_spider

    • huggingface.co
    Updated Jun 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turbular (2024). fixed_spider [Dataset]. https://huggingface.co/datasets/Turbular/fixed_spider
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2024
    Dataset authored and provided by
    Turbular
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Cleaned Spider Dataset for Text2SQL

      Dataset Summary
    

    The Cleaned Spider Dataset for Text2SQL is an improved version of the original Spider dataset, which is a large-scale, complex, and cross-domain semantic parsing and text-to-SQL dataset. This enhanced version addresses several critical issues found in the original dataset, ensuring higher quality and reliability for training text-to-SQL models. The enhancements were made possible through Turbular's advanced data… See the full description on the dataset page: https://huggingface.co/datasets/Turbular/fixed_spider.

  7. h

    wikisql-generate

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taqi Jaffri, wikisql-generate [Dataset]. https://huggingface.co/datasets/tjaffri/wikisql-generate
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Taqi Jaffri
    License

    https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/

    Description

    WikiSQL Dataset (Reformatted for Generative Models)

    This is the exact same dataset as WikiSQL: https://huggingface.co/datasets/wikisql, but with the data reformatted to allow direct use with text generation LLMs. The original license and credits for the original dataset remain in place. Specifically, the changes from standard WikiSQL are:

    The table details in WikiSQL were included as dictionaries but tools like LangChain and LlamaIndex build their prompts using a SQL DESCRIBE of… See the full description on the dataset page: https://huggingface.co/datasets/tjaffri/wikisql-generate.

  8. h

    Llama-2-SQL-and-Code-Dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Hayduk, Llama-2-SQL-and-Code-Dataset [Dataset]. https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-and-Code-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Chris Hayduk
    Description

    Dataset Card for "Llama-2-SQL-and-Code-Dataset"

    This dataset is intended to provide LLaMA 2 improved coding and instruction following capabilities, with a specific focus on SQL generation. The dataset is in Alpaca Instruct format. Please be sure to provide the instruction and input in the prompt to the model, along with any prompt text you would like to place around those inputs. In the train split, please ignore the table column. The eval split provides example tables so that the… See the full description on the dataset page: https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-and-Code-Dataset.

  9. h

    SynsQL-Think-916k

    • huggingface.co
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cycloneboy (2025). SynsQL-Think-916k [Dataset]. https://huggingface.co/datasets/cycloneboy/SynsQL-Think-916k
    Explore at:
    Dataset updated
    Jul 31, 2025
    Authors
    cycloneboy
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    SLM-SQL: An Exploration of Small Language Models for Text-to-SQL

      Important Links
    

    📖Arxiv Paper | 🤗HuggingFace | 🤖ModelScope |

      News
    

    July 31, 2025: Upload model to modelscope and huggingface. July 30, 2025: Publish the paper to arxiv

      Introduction
    

    Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to… See the full description on the dataset page: https://huggingface.co/datasets/cycloneboy/SynsQL-Think-916k.

  10. h

    OGText2SQL

    • huggingface.co
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OneGate (2024). OGText2SQL [Dataset]. https://huggingface.co/datasets/OneGate/OGText2SQL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 24, 2024
    Dataset authored and provided by
    OneGate
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Summary

    OGText2SQL dataset was utilized in training the OGSQL model, this dataset comprises over 350,000 rows of text-to-SQL pairs. Through a series of data refining steps, including schema expansion, SQL refinement, and instruction generation using existing Language Models (LLMs), the dataset was meticulously processed to ensure quality and relevance.

      How to use it
    

    Python

    from datasets import load_dataset

    dataset = load_dataset("OneGate/OGText2SQL")

    API… See the full description on the dataset page: https://huggingface.co/datasets/OneGate/OGText2SQL.

  11. h

    sql-create-context-chatml

    • huggingface.co
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Re:cast AI (2024). sql-create-context-chatml [Dataset]. https://huggingface.co/datasets/recastai/sql-create-context-chatml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2024
    Dataset authored and provided by
    Re:cast AI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Summary

    This dataset has been created by Re:cast AI to extend the existing dataset b-mc2/sql-create-context into a chatml friendly format for use in SFT tasks with pretrained models.

      Dataset Structure
    

    messages = [ {'content': "You are a powerful text-to-SQL AI assistant that helps users ... etc.", 'role': 'system'}, {'content': '(Optional) Context information is below ... etc.', 'role': 'user'}, {'content': 'SELECT COUNT(*) FROM head WHERE age > 56'… See the full description on the dataset page: https://huggingface.co/datasets/recastai/sql-create-context-chatml.

  12. h

    SynQL-KaggleDBQA-Topics

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semiotic Labs, SynQL-KaggleDBQA-Topics [Dataset]. https://huggingface.co/datasets/semiotic/SynQL-KaggleDBQA-Topics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Semiotic Labs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for SynQL-KaggleDBQA-Train-Topics

    Developed by: Semiotic Labs Model type: [Text to SQL] License: [Apache-2.0]

      Dataset Details
    

    Example view of data: { "StudentMathScore": { "1": "Federal Revenue Data (Questions about federal revenue information related to different states and school districts)", "2": "Math Score Data (Questions about the average math scores in different states for grade 8 students)", "3": "Revenue Key Data… See the full description on the dataset page: https://huggingface.co/datasets/semiotic/SynQL-KaggleDBQA-Topics.

  13. h

    SynQL-Spider-Train-Source-Templates

    • huggingface.co
    Updated Oct 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semiotic Labs (2024). SynQL-Spider-Train-Source-Templates [Dataset]. https://huggingface.co/datasets/semiotic/SynQL-Spider-Train-Source-Templates
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 26, 2024
    Dataset authored and provided by
    Semiotic Labs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for SynQL-Spider-Train-Source-Templates

    Developed by: Semiotic Labs Model type: [Text to SQL] License: [Apache-2.0]

      Dataset Details
    

    Example view of data: { "0": "SELECT ? FROM ? JOIN ? ON ? = ? WHERE ? IN (SELECT ? FROM ? JOIN ? ON ? = ? WHERE ? = ?)", "1": "SELECT COUNT(?) FROM ? JOIN ? ON ? = ? WHERE ? != ?", "2": "SELECT ? FROM ? WHERE ? = ? OR ? = ?", "3": "SELECT ? FROM ? WHERE ? LIKE ? ORDER BY ?", ... "911": "SELECT ? FROM ?… See the full description on the dataset page: https://huggingface.co/datasets/semiotic/SynQL-Spider-Train-Source-Templates.

  14. h

    birdbench-duckdb

    • huggingface.co
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayantan Das (2025). birdbench-duckdb [Dataset]. https://huggingface.co/datasets/ucalyptus/birdbench-duckdb
    Explore at:
    Dataset updated
    Mar 27, 2025
    Authors
    Sayantan Das
    Description

    BirdBench Dataset in DuckDB format

    BirdBench is a benchmark for text-to-SQL capabilities, now available in DuckDB format for improved performance and usability.

      About BirdBench
    

    BirdBench is a comprehensive benchmark dataset for evaluating text-to-SQL capabilities of language models. It features a diverse collection of databases spanning various domains including:

    Business and finance Entertainment and media Sports and recreation Health and medicine Education Travel and… See the full description on the dataset page: https://huggingface.co/datasets/ucalyptus/birdbench-duckdb.

  15. h

    Spider-1C

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander, Spider-1C [Dataset]. https://huggingface.co/datasets/kavlab/Spider-1C
    Explore at:
    Authors
    Alexander
    Description

    Spider-1C: A Text-to-SQL Dataset Adapted for 1C:Enterprise

    Spider-1C is a specialized adaptation of the well-known Spider dataset designed specifically for the task of Text-to-SQL within the context of 1C:Enterprise, a widely used ERP platform in Russia and other countries. This dataset facilitates the training and evaluation of large language models (LLMs) aimed at converting natural language queries (primarily in Russian) into 1C-specific SQL queries (1C Query Language).… See the full description on the dataset page: https://huggingface.co/datasets/kavlab/Spider-1C.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mir Mudasir (2025). text-to-sql [Dataset]. https://huggingface.co/datasets/Mudasir692/text-to-sql

Data from: text-to-sql

Mudasir692/text-to-sql

Related Article
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 20, 2025
Authors
Mir Mudasir
Description

This dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.

Search
Clear search
Close search
Google apps
Main menu