Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset consists of 8,034 entries designed to evaluate the performance of text-to-SQL models. Each entry contains a natural language text query and its corresponding SQL command. The dataset is a subset derived from the Spider dataset, focusing on diverse and complex queries to challenge the understanding and generation capabilities of machine learning models.
Facebook
TwitterThis is the sample database from sqlservertutorial.net. This is a great dataset for learning SQL and practicing querying relational databases.
Database Diagram:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Fc5838eb006bab3938ad94de02f58c6c1%2FSQL-Server-Sample-Database.png?generation=1692609884383007&alt=media" alt="">
The sample database is copyrighted and cannot be used for commercial purposes. For example, it cannot be used for the following but is not limited to the purposes: - Selling - Including in paid courses
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is designed for training models to convert natural language prompts into SQL queries, specifically focusing on SELECT statements. The dataset comprises 14,815 examples where each prompt is associated with the corresponding SQL query that would retrieve the desired information from a specific table.
Columns: Prompt: The natural language text representing a query request. SQL Query: The corresponding SQL query generated to fulfill the request.
Facebook
TwitterThis dataset contains information about housing sales in Nashville, TN such as property, owner, sales, and tax information. The SQL queries I created for Data Cleaning can be found here.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I've always wanted to explore Kaggle's Meta Kaggle dataset but I am more comfortable on using TSQL when it comes to writing (very) complex queries. Also, I tend to write queries faster when using SQL MANAGEMENT STUDIO, like 100x faster. So, I ported Kaggle's Meta Kaggle dataset into MS SQL SERVER 2022 database format, created a backup file, then uploaded it here.
Explore Kaggle's public data on competitions, datasets, kernels (code/ notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but they think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F2ad97bce7839d6e57674e7a82981ed23%2F2Egeb8R.png?generation=1688912953875842&alt=media" alt="">
Facebook
TwitterA cleaned SQL injection dataset, sourced from multiple Kaggle datasets, has been cleaned and split into training, validation, and testing subsets with a 6:2:2 ratio. This dataset is intended for use in research focused on detecting SQL injection attacks.
Label: This column represents the label for the SQL injection binary classification, where 1 indicates an SQL injection and 0 indicates a non-SQL injection.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20832724%2Fc0d648e947cec02e35beb665b24b5bdb%2Fsql-injection-datasets.png?generation=1718710914956418&alt=media" alt="">
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface
- This dataset can be used to develop natural language interfaces for relational databases.
- This dataset can be used to develop a knowledge base of common SQL queries.
- This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
I imported the two Olist Kaggle datasets into an SQLite database. I modified the original table names to make them shorter and easier to understand. Here's the Entity-Relationship Diagram of the resulting SQLite database:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2473556%2F23a7d4d8cd99e36e32e57303eb804fff%2Fdb-schema.png?generation=1714391550829633&alt=media" alt="Database Schema">
Data sources:
https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
https://www.kaggle.com/datasets/olistbr/marketing-funnel-olist
I used this database as a data source for my notebook:
Facebook
TwitterThis dataset is a practical SQL case study designed for learners who are looking to enhance their SQL skills in analyzing sales, products, and marketing data. It contains several SQL queries related to a simulated business database for product sales, marketing expenses, and location data. The database consists of three main tables: Fact, Product, and Location.
Objective of the Case Study: The purpose of this case study is to provide learners with a variety of practical SQL exercises that involve real-world business problems. The queries explore topics such as:
Facebook
TwitterRSVP Movies is an Indian film production company which has produced many super-hit movies. They have usually released movies for the Indian audience but for their next project, they are planning to release a movie for the global audience in 2022.
The production company wants to plan their every move analytically based on data. We have taken the last three years IMDB movies data and carried out the analysis using SQL. We have analysed the data set and drew meaningful insights that could help them start their new project.
For our convenience, the entire analytics process has been divided into four segments, where each segment leads to significant insights from different combinations of tables. The questions in each segment with business objectives are written in the script given below. We have written the solution code below every question.
Facebook
TwitterFamous paintings and their artists. This data set is published to help students have interesting data to practice SQL
Foto von Steve Johnson auf Unsplash
Facebook
TwitterThis dataset was created by Rodrigo Fontanella
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Andrew Dolcimascolo-Garrett
Released under MIT
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Michael Nowell
Released under Database: Open Database, Contents: Database Contents
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Yentür
Released under MIT
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is built for Text-to-SQL (NL → SQL) tasks, helping train models to convert natural language into SQL queries. It is ideal for fine-tuning LLMs, developing AI-powered database assistants, and improving SQL query generation accuracy.
Each row contains the following fields:
- 📝 Instruction – A natural language query (e.g., "Find all customers who placed an order in the last 30 days.")
- 📊 Query – The corresponding SQL statement (e.g., SELECT * FROM orders WHERE order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY);)
- 🗄️ Database – Contains metadata such as:
- Table Names – The relevant tables for the query (e.g., orders, customers)
- Column Names – The specific fields used in the query (e.g., order_date, customer_id)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Ada Luo daa
Released under Apache 2.0
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by vishnusan
Released under Apache 2.0
Facebook
TwitterThis dataset was created by Parth Mistry 20
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by rayten
Released under Apache 2.0
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset consists of 8,034 entries designed to evaluate the performance of text-to-SQL models. Each entry contains a natural language text query and its corresponding SQL command. The dataset is a subset derived from the Spider dataset, focusing on diverse and complex queries to challenge the understanding and generation capabilities of machine learning models.