http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Michael Nowell
Released under Database: Open Database, Contents: Database Contents
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
[EN] This is a new dataset for Text-to-SQL tasks that contains frequently encountered in daily use 1300 SQL queries and their natural language description. Of these, 966 of them consist of SELECT queries, and 337 of them consist of DELETE and UPDATE queries.
This share is presented publicly and has a flexible structure so that anyone who wants to contribute can make additions. You can also add your own SQL queries or edit existing queries.
I hope this resource will support your SQL learning process and help you. We are waiting for your contributions!
[TR]
Bu veri seti günlük kullanımda sıkça karşılaşılan 1300 SQL sorgusu ve bunların doğal dil açıklamalarını içeren Text-to-SQL görevleri için oluşturulmuştur.. Bunlardan 966 adedi SELECT, 337 adedi ise DELETE ve UPDATE sorgularından oluşmaktadır.
Bu paylaşım herkese açık olarak sunulmuştur ve katkıda bulunmak isteyen herkesin eklemeler yapabilmesi için esnetilebilir bir yapıya sahiptir. Siz de kendi SQL sorgularınızı ekleyebilir veya mevcut sorguları düzenleyebilirsiniz.
Umarım bu kaynak, SQL öğrenme sürecinizi destekler ve size yardımcı olur. Katkılarınızı bekliyoruz!
This dataset consists of 8,034 entries designed to evaluate the performance of text-to-SQL models. Each entry contains a natural language text query and its corresponding SQL command. The dataset is a subset derived from the Spider dataset, focusing on diverse and complex queries to challenge the understanding and generation capabilities of machine learning models.
Original Data Source: Text to SQL dataset
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
World Development Indicators (WDI) by World Bank includes data spanning up to 56 years—from 1960 to 2016. WDI frames global trends with indicators on population, population density, urbanization, GNI, and GDP. These indicators measure the world’s economy and progress toward improving lives, achieving sustainable development, providing support for vulnerable populations, and reducing gender disparities.
World Development Indicators Data is the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
“World Development Indicators” by the World Bank, used under CC BY 3.0 IGO.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:worldbank_wdi
Banner photo by Joshua Rawson-Harris on Unsplash
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
About this dataset A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.
More Datasets For more datasets, click here.
Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface
Research Ideas This dataset can be used to develop natural language interfaces for relational databases. This dataset can be used to develop a knowledge base of common SQL queries. This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries Acknowledgements If you use this dataset in your research, please credit the original authors.
CC0
Original Data Source: WikiSQL (Questions and SQL Queries)`
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
This dataset is from the SEC's Financial Statements and Notes Data Set.
It was a personal project to see if I could make the queries efficient.
It's just been collecting dust ever since, maybe someone will make good use of it.
Data is up to about early-2024.
It doesn't differ from the source, other than it's compiled - so maybe you can try it out, then compile your own (with the link below).
Dataset was created using SEC Files and SQL Server on Docker.
For details on the SQL Server database this came from, see: "dataset-previous-life-info" folder, which will contain:
- Row Counts
- Primary/Foreign Keys
- SQL Statements to recreate database tables
- Example queries on how to join the data tables.
- A pretty picture of the table associations.
Source: https://www.sec.gov/data-research/financial-statement-notes-data-sets
Happy coding!
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This curated dataset consists of 269,353 patent documents (published patent applications and granted patents) spanning the 1976 to 2016 period and is intended to help identify promising R&D on the horizon in diagnostics, therapeutics, data analytics, and model biological systems.
USPTO Cancer Moonshot Patent Data was generated using USPTO examiner tools to execute a series of queries designed to identify cancer-specific patents and patent applications. This includes drugs, diagnostics, cell lines, mouse models, radiation-based devices, surgical devices, image analytics, data analytics, and genomic-based inventions.
“USPTO Cancer Moonshot Patent Data” by the USPTO, for public use. Frumkin, Jesse and Myers, Amanda F., Cancer Moonshot Patent Data (August, 2016).
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_oce_cancer
Overview This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, which was constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context.
The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualized SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.
Dataset Details Total Examples: 78,577 Columns: pergunta: The question in natural language. contexto: The SQL CREATE TABLE statement that provides the necessary context to answer the question. resposta: The SQL query that answers the question using the provided context. Translation Process The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model, ensuring that the natural language queries maintain the same meaning and context as the original English questions.
Objective and Applications This dataset is ideal for training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial. It can be used to enhance model performance in text-to-SQL tasks, providing clear context and avoiding common hallucination errors.
Original Projects @misc{b-mc2_2023_sql-create-context, title = {sql-create-context Dataset}, author = {b-mc2}, year = {2023}, url = {https://huggingface.co/datasets/b-mc2/sql-create-context}, note = {This dataset was created by modifying data from the following sources: \cite{zhongSeq2SQL2017, yu2018spider}.}, } @article{zhongSeq2SQL2017, author = {Victor Zhong and Caiming Xiong and Richard Socher}, title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning}, journal = {CoRR}, volume = {abs/1709.00103}, year = {2017} } @article{yu2018spider, title = {Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task}, author = {Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others}, journal = {arXiv preprint arXiv:1809.08887}, year = {2018} }
CC-BY-NC
Original Data Source: Portuguese Text2SQL database
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset was acquired via an SQL query to the Big Query google platform. I built a recommender app and performed EDA using this data which can be viewed here [https://bordei-recommender.streamlit.app/]. A majority of the data is categorical which makes working with it good practice for those who have little experience with categorical data.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Office Action Research Dataset for Patents contains detailed information derived from the Office actions issued by patent examiners to applicants during the patent examination process. The “Office action” is a written notification to the applicant of the examiner’s decision on patentability and generally discloses the grounds for a rejection, the claims affected, and the pertinent prior art.
This initial release consists of three files derived from 4.4 million Office actions mailed during the 2008 to mid-2017 period from USPTO examiners to the applicants of 2.2 million unique patent applications.
A working paper describing this dataset is available and can be cited as Lu, Qiang and Myers, Amanda F. and Beliveau, Scott, USPTO Patent Prosecution Research Data: Unlocking Office Action Traits (November 20, 2017). USPTO Economic Working Paper No. 2017-10. Available at SSRN: https://ssrn.com/abstract=3024621 (link is external).
This effort is made possible by the USPTO Digital Services & Big Data portfolio and collaboration with the USPTO Office of the Chief Economist (OCE). The OCE provides these data files for public use and encourages users to identify fixes and improvements. Please provide all feedback to: EconomicsData@uspto.gov.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_oce_office_actions
Banner photo by Trent Erwin on Unsplash
Patent Examination Data System gives users access to multiple records of USPTO patent application or patent filing status at no cost. PEDS is updated daily and mirrors the data available in the Patent Application Location and Monitoring system (PALM). PEDS provides access to public applications including: published patent applications and patents. PCT applications that have not been published by WIPO. Any applications that have not been released by the USPTO will not be available in PEDS.
USPTO Patent Examiner Data System (PEDS) API Data contains data from the examination process of USPTO patent applications. PEDS contains the bibliographic, published document and patent term extension data tabs in Public PAIR from 1981 to present. There is also some data dating back to 1935.
Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.
"Patent Examination Data System" by the USPTO, for public use.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_peds
Banner photo by Thought Catalog on Unsplash
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Office of the Chief Economist (OCE) is responsible for advising the Under Secretary of Commerce for Intellectual Property and Director of the USPTO on the economic implications of policies and programs affecting the U.S. intellectual property (IP) system. The office disseminates detailed patent and trademark data, undertakes research, and conducts economic analysis on a variety of IP issues. OCE works with policy makers, collaborates with academics, and engages the public more generally through conferences it organizes, the publicly accessible research datasets it provides, and its publications.
The USPTO OCE Patent Assignment Dataset contains detailed data patent assignments and other transactions recorded at the USPTO since 1970.
"USPTO OCE Patent Assignment Data" by the USPTO, for public use. Marco, Alan C., Graham, Stuart J.H., Myers, Amanda F., D'Agostino, Paul A and Apple, Kirsten, "The USPTO Patent Assignment Dataset: Descriptions and Analysis" (July 27, 2015).
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:uspto_oce_assignment
Banner photo by Jeff Sheldon on Unsplash
Not seeing a result you expected?
Learn how you can add new datasets to our index.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Michael Nowell
Released under Database: Open Database, Contents: Database Contents