13 datasets found

SQL Queries
kaggle.com
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Nowell (2024). SQL Queries [Dataset]. https://www.kaggle.com/datasets/michaelnowell/sql-queries/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Michael Nowell
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Dataset

This dataset was created by Michael Nowell

Released under Database: Open Database, Contents: Database Contents

Contents
P
WikiSQL Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Zhong; Caiming Xiong; Richard Socher, WikiSQL Dataset [Dataset]. https://paperswithcode.com/dataset/wikisql
Explore at:
Authors
Victor Zhong; Caiming Xiong; Richard Socher
Description
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
Turkish Query and SQL_Türkçe Sorular ve SQL
kaggle.com
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sitem BARIŞ (2023). Turkish Query and SQL_Türkçe Sorular ve SQL [Dataset]. https://www.kaggle.com/datasets/sitembari/turkish-query-and-sql-trke-sorular-ve-sql/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sitem BARIŞ
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
[EN] This is a new dataset for Text-to-SQL tasks that contains frequently encountered in daily use 1300 SQL queries and their natural language description. Of these, 966 of them consist of SELECT queries, and 337 of them consist of DELETE and UPDATE queries.

This share is presented publicly and has a flexible structure so that anyone who wants to contribute can make additions. You can also add your own SQL queries or edit existing queries.

I hope this resource will support your SQL learning process and help you. We are waiting for your contributions!

[TR]

Bu veri seti günlük kullanımda sıkça karşılaşılan 1300 SQL sorgusu ve bunların doğal dil açıklamalarını içeren Text-to-SQL görevleri için oluşturulmuştur.. Bunlardan 966 adedi SELECT, 337 adedi ise DELETE ve UPDATE sorgularından oluşmaktadır.

Bu paylaşım herkese açık olarak sunulmuştur ve katkıda bulunmak isteyen herkesin eklemeler yapabilmesi için esnetilebilir bir yapıya sahiptir. Siz de kendi SQL sorgularınızı ekleyebilir veya mevcut sorguları düzenleyebilirsiniz.

Umarım bu kaynak, SQL öğrenme sürecinizi destekler ve size yardımcı olur. Katkılarınızı bekliyoruz!
o
Text to SQL dataset
opendatabay.com
.undefined
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Text to SQL dataset [Dataset]. https://www.opendatabay.com/data/science-research/03ab3b68-bf0d-44b9-a1cb-cab6072881bb
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 14, 2025
Dataset authored and provided by
Datasimple
Area covered
Data Science and Analytics
Description
This dataset consists of 8,034 entries designed to evaluate the performance of text-to-SQL models. Each entry contains a natural language text query and its corresponding SQL command. The dataset is a subset derived from the Spider dataset, focusing on diverse and complex queries to challenge the understanding and generation capabilities of machine learning models.

Original Data Source: Text to SQL dataset
World Development Indicators (WDI) Data
kaggle.com
zip
Updated Aug 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2018). World Development Indicators (WDI) Data [Dataset]. https://www.kaggle.com/datasets/bigquery/worldbank-wdi
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Aug 27, 2018
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

World Development Indicators (WDI) by World Bank includes data spanning up to 56 years—from 1960 to 2016. WDI frames global trends with indicators on population, population density, urbanization, GNI, and GDP. These indicators measure the world’s economy and progress toward improving lives, achieving sustainable development, providing support for vulnerable populations, and reducing gender disparities.

Content

World Development Indicators Data is the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.

Acknowledgements

“World Development Indicators” by the World Bank, used under CC BY 3.0 IGO.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:worldbank_wdi

Banner photo by Joshua Rawson-Harris on Unsplash
o
WikiSQL (Questions and SQL Queries)
opendatabay.com
.undefined
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.opendatabay.com/data/ai-ml/5a0fa182-be98-46d5-96e4-60ac97c14760
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
About this dataset A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

Research Ideas This dataset can be used to develop natural language interfaces for relational databases. This dataset can be used to develop a knowledge base of common SQL queries. This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries Acknowledgements If you use this dataset in your research, please credit the original authors.

License

CC0

Original Data Source: WikiSQL (Questions and SQL Queries)`
SECs Compiled Financial Statements & Notes Dataset
kaggle.com
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deny Tran (2024). SECs Compiled Financial Statements & Notes Dataset [Dataset]. https://www.kaggle.com/datasets/denytran/im-a-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 31, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deny Tran
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
This dataset is from the SEC's Financial Statements and Notes Data Set.
It was a personal project to see if I could make the queries efficient.
It's just been collecting dust ever since, maybe someone will make good use of it.
Data is up to about early-2024.
It doesn't differ from the source, other than it's compiled - so maybe you can try it out, then compile your own (with the link below).
Dataset was created using SEC Files and SQL Server on Docker.
For details on the SQL Server database this came from, see: "dataset-previous-life-info" folder, which will contain: - Row Counts - Primary/Foreign Keys - SQL Statements to recreate database tables - Example queries on how to join the data tables. - A pretty picture of the table associations. Source: https://www.sec.gov/data-research/financial-statement-notes-data-sets

Happy coding!
USPTO Cancer Moonshot Patent Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). USPTO Cancer Moonshot Patent Data [Dataset]. https://www.kaggle.com/datasets/bigquery/uspto-oce-cancer
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

This curated dataset consists of 269,353 patent documents (published patent applications and granted patents) spanning the 1976 to 2016 period and is intended to help identify promising R&D on the horizon in diagnostics, therapeutics, data analytics, and model biological systems.

Content

USPTO Cancer Moonshot Patent Data was generated using USPTO examiner tools to execute a series of queries designed to identify cancer-specific patents and patent applications. This includes drugs, diagnostics, cell lines, mouse models, radiation-based devices, surgical devices, image analytics, data analytics, and genomic-based inventions.

Acknowledgements

“USPTO Cancer Moonshot Patent Data” by the USPTO, for public use. Frumkin, Jesse and Myers, Amanda F., Cancer Moonshot Patent Data (August, 2016).

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_oce_cancer

Banner photo by Jaron Nix on Unsplash
o
Portuguese Text2SQL database
opendatabay.com
.undefined
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Portuguese Text2SQL database [Dataset]. https://www.opendatabay.com/data/ai-ml/e4213f60-3136-497b-a7ac-09504fbd0b79
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 22, 2025
Dataset authored and provided by
Datasimple
Area covered
Data Science and Analytics
Description
Overview This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, which was constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context.

The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualized SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.

Dataset Details Total Examples: 78,577 Columns: pergunta: The question in natural language. contexto: The SQL CREATE TABLE statement that provides the necessary context to answer the question. resposta: The SQL query that answers the question using the provided context. Translation Process The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model, ensuring that the natural language queries maintain the same meaning and context as the original English questions.

Objective and Applications This dataset is ideal for training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial. It can be used to enhance model performance in text-to-SQL tasks, providing clear context and avoiding common hallucination errors.

Original Projects @misc{b-mc2_2023_sql-create-context, title = {sql-create-context Dataset}, author = {b-mc2}, year = {2023}, url = {https://huggingface.co/datasets/b-mc2/sql-create-context}, note = {This dataset was created by modifying data from the following sources: \cite{zhongSeq2SQL2017, yu2018spider}.}, } @article{zhongSeq2SQL2017, author = {Victor Zhong and Caiming Xiong and Richard Socher}, title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning}, journal = {CoRR}, volume = {abs/1709.00103}, year = {2017} } @article{yu2018spider, title = {Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task}, author = {Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others}, journal = {arXiv preprint arXiv:1809.08887}, year = {2018} }

License

CC-BY-NC

Original Data Source: Portuguese Text2SQL database
Google Merchandise Analytics (Large +3GB)
kaggle.com
Updated Nov 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Bordei (2024). Google Merchandise Analytics (Large +3GB) [Dataset]. https://www.kaggle.com/datasets/andrewbordei/google-merchandise-analytics-large-3gb/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Andrew Bordei
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The dataset was acquired via an SQL query to the Big Query google platform. I built a recommender app and performed EDA using this data which can be viewed here [https://bordei-recommender.streamlit.app/]. A majority of the data is categorical which makes working with it good practice for those who have little experience with categorical data.
The Office Action Research Dataset for Patents
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). The Office Action Research Dataset for Patents [Dataset]. https://www.kaggle.com/datasets/bigquery/uspto-oce-office-actions
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

The Office Action Research Dataset for Patents contains detailed information derived from the Office actions issued by patent examiners to applicants during the patent examination process. The “Office action” is a written notification to the applicant of the examiner’s decision on patentability and generally discloses the grounds for a rejection, the claims affected, and the pertinent prior art.

Content

This initial release consists of three files derived from 4.4 million Office actions mailed during the 2008 to mid-2017 period from USPTO examiners to the applicants of 2.2 million unique patent applications.

Acknowledgements

A working paper describing this dataset is available and can be cited as Lu, Qiang and Myers, Amanda F. and Beliveau, Scott, USPTO Patent Prosecution Research Data: Unlocking Office Action Traits (November 20, 2017). USPTO Economic Working Paper No. 2017-10. Available at SSRN: https://ssrn.com/abstract=3024621 (link is external).

This effort is made possible by the USPTO Digital Services & Big Data portfolio and collaboration with the USPTO Office of the Chief Economist (OCE). The OCE provides these data files for public use and encourages users to identify fixes and improvements. Please provide all feedback to: EconomicsData@uspto.gov.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_oce_office_actions

Banner photo by Trent Erwin on Unsplash
USPTO Patent Examiner Data System (PEDS) Data
kaggle.com
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). USPTO Patent Examiner Data System (PEDS) Data [Dataset]. https://www.kaggle.com/bigquery/uspto-peds
Explore at:
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
Description
Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

Patent Examination Data System gives users access to multiple records of USPTO patent application or patent filing status at no cost. PEDS is updated daily and mirrors the data available in the Patent Application Location and Monitoring system (PALM). PEDS provides access to public applications including: published patent applications and patents. PCT applications that have not been published by WIPO. Any applications that have not been released by the USPTO will not be available in PEDS.

Content

USPTO Patent Examiner Data System (PEDS) API Data contains data from the examination process of USPTO patent applications. PEDS contains the bibliographic, published document and patent term extension data tabs in Public PAIR from 1981 to present. There is also some data dating back to 1935.

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

"Patent Examination Data System" by the USPTO, for public use.

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_peds

Banner photo by Thought Catalog on Unsplash
USPTO OCE Patent Assignment Dataset
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). USPTO OCE Patent Assignment Dataset [Dataset]. https://www.kaggle.com/bigquery/uspto-oce-assignment
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

The Office of the Chief Economist (OCE) is responsible for advising the Under Secretary of Commerce for Intellectual Property and Director of the USPTO on the economic implications of policies and programs affecting the U.S. intellectual property (IP) system. The office disseminates detailed patent and trademark data, undertakes research, and conducts economic analysis on a variety of IP issues. OCE works with policy makers, collaborates with academics, and engages the public more generally through conferences it organizes, the publicly accessible research datasets it provides, and its publications.

Content

The USPTO OCE Patent Assignment Dataset contains detailed data patent assignments and other transactions recorded at the USPTO since 1970.

Acknowledgements

"USPTO OCE Patent Assignment Data" by the USPTO, for public use. Marco, Alan C., Graham, Stuart J.H., Myers, Amanda F., D'Agostino, Paul A and Apple, Kirsten, "The USPTO Patent Assignment Dataset: Descriptions and Analysis" (July 27, 2015).

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_oce_assignment

Banner photo by Jeff Sheldon on Unsplash
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael Nowell (2024). SQL Queries [Dataset]. https://www.kaggle.com/datasets/michaelnowell/sql-queries/discussion

SQL Queries

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 19, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Michael Nowell

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Dataset

This dataset was created by Michael Nowell

Released under Database: Open Database, Contents: Database Contents

Clear search

Close search

Google apps

Main menu

SQL Queries

Dataset

Contents

WikiSQL Dataset

Turkish Query and SQL_Türkçe Sorular ve SQL

Text to SQL dataset

World Development Indicators (WDI) Data

Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

Content

Acknowledgements

WikiSQL (Questions and SQL Queries)

License

SECs Compiled Financial Statements & Notes Dataset

USPTO Cancer Moonshot Patent Data

Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

Content

Acknowledgements

Portuguese Text2SQL database

License

Google Merchandise Analytics (Large +3GB)

The Office Action Research Dataset for Patents

Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

Content

Acknowledgements

USPTO Patent Examiner Data System (PEDS) Data

Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

Content

Acknowledgements

USPTO OCE Patent Assignment Dataset

Fork this notebook to get started on accessing data in the BigQuery dataset by writing SQL queries using the BQhelper module.

Context

Content

Acknowledgements

SQL Queries

Dataset

Contents