WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
Spider dataset is used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains the outcome of a series of pull-out tests (and 3D scan analysis) in the investigation of the bond behavior of naturally-corroded, plain reinforcement sourced from a decommissioned structure. The contents of this data set includes 1) photos and test measurements for all tested rebars; 2) an SQL database containing all pullout (unprocessed) data; and 3) an SQL database containing processed data. Data is provided in SQL format to permit querying of entries across the otherwise large, and parametrically diverse, data set. A "Read Me" file is also provided for additional descriptions of the content within.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.
To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.
This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.
Tables:
- history_*
tables: full history of OSM objects.
- planet_*
tables: snapshot of current OSM objects as of Nov 2019.
The history_*
and planet_*
table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.
You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.
Objective: To enhance the accuracy of information retrieval from pharmacovigilance (PV) databases by employing Large Language Models (LLMs) to convert natural language queries (NLQs) into Structured Query Language (SQL) queries, leveraging a business context document. Materials and Methods: We utilized OpenAI’s GPT-4 model within a retrieval-augmented generation (RAG) framework, enriched with a business context document, to transform NLQs into executable SQL queries. Each NLQ was presented to the LLM randomly and independently to prevent memorization. The study was conducted in three phases, varying query complexity, and assessing the LLM's performance both with and without the business context document. Results: Our approach significantly improved NLQ-to-SQL accuracy, increasing from 8.3% with the database schema alone to 78.3% with the business context document. This enhancement was consistent across low, medium, and high complexity queries, indicating the critical role of contextual ..., Test set of NLQ's used in the paper Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL. Also included are the Python scripts for the LLM processing, the R code for statistical analysis of results, and a copy of the business context document and essential tables., , # Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL
https://doi.org/10.5061/dryad.2280gb63n
NLQ_Queries.xls contains the set of test NLQs along with the results of the LLM response in each phase of the experiment. Each NLQ also contains the complexity scores computed for each.
The business context document is supplied as a PDF, together with the Python and R code used to generate our results. The essential tables used in Phase 2 and 3 of the experiment are included in the text file.
Description:Â Contains all NLQ queries with the results of the LLM output and the pass, fail status of each.
Column Definitions:
Below are the column names in order with a detailed description.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The USPTO grants US patents to inventors and assignees all over the world. For researchers in particular, PatentsView is intended to encourage the study and understanding of the intellectual property (IP) and innovation system; to serve as a fundamental function of the government in creating “public good” platforms in these data; and to eliminate redundant cleaning, converting and matching of these data by individual researchers, thus freeing up researcher time to do what they do best—study IP, innovation, and technological change.
PatentsView Data is a database that longitudinally links inventors, their organizations, locations, and overall patenting activity. The dataset uses data derived from USPTO bulk data files.
Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.
“PatentsView” by the USPTO, US Department of Agriculture (USDA), the Center for the Science of Science and Innovation Policy, New York University, the University of California at Berkeley, Twin Arch Technologies, and Periscopic, used under CC BY 4.0.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:patentsview
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.
ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.
Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png
Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html
Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.
“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl
Knowledge Graph Construction Workshop 2023: challenge Knowledge graph construction of heterogeneous data has seen a lot of uptake
in the last decade from compliance to performance optimizations with respect
to execution time. Besides execution time as a metric for comparing knowledge
graph construction, other metrics e.g. CPU or memory usage are not considered.
This challenge aims at benchmarking systems to find which RDF graph
construction system optimizes for metrics e.g. execution time, CPU,
memory usage, or a combination of these metrics. Task description The task is to reduce and report the execution time and computing resources
(CPU and memory usage) for the parameters listed in this challenge, compared
to the state-of-the-art of the existing tools and the baseline results provided
by this challenge. This challenge is not limited to execution times to create
the fastest pipeline, but also computing resources to achieve the most efficient
pipeline. We provide a tool which can execute such pipelines end-to-end. This tool also
collects and aggregates the metrics such as execution time, CPU and memory
usage, necessary for this challenge as CSV files. Moreover, the information
about the hardware used during the execution of the pipeline is available as
well to allow fairly comparing different pipelines. Your pipeline should consist
of Docker images which can be executed on Linux to run the tool. The tool is
already tested with existing systems, relational databases e.g. MySQL and
PostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuoso
which can be combined in any configuration. It is strongly encouraged to use
this tool for participating in this challenge. If you prefer to use a different
tool or our tool imposes technical requirements you cannot solve, please contact
us directly. Part 1: Knowledge Graph Construction Parameters These parameters are evaluated using synthetic generated data to have more
insights of their influence on the pipeline. Data Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records). Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns). Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%). Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%). Number of input files: scaling the number of datasets (1, 5, 10, 15). Mappings Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs). Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs). Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M) Part 2: GTFS-Madrid-Bench The GTFS-Madrid-Bench provides insights in the pipeline with real data from the
public transport domain in Madrid. Scaling GTFS-1 SQL GTFS-10 SQL GTFS-100 SQL GTFS-1000 SQL Heterogeneity GTFS-100 XML + JSON GTFS-100 CSV + XML GTFS-100 CSV + JSON GTFS-100 SQL + XML + JSON + CSV Example pipeline The ground truth dataset and baseline results are generated in different steps
for each parameter: The provided CSV files and SQL schema are loaded into a MySQL relational database. Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format. The constructed knowledge graph is loaded into a Virtuoso triplestore, tuned according to the Virtuoso documentation. The provided SPARQL queries are executed on the SPARQL endpoint exposed by Virtuoso. The pipeline is executed 5 times from which the median execution time of each
step is calculated and reported. Each step with the median execution time is
then reported in the baseline results with all its measured metrics.
Query timeout is set to 1 hour and knowledge graph construction timeout
to 24 hours. The execution is performed with the following tool
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code and data extracts from which Figures 1-4 where derived and upon which several comments in main body of the text rely.Each Excel File contains a data extract and, where appropriate, the SQL code for Dimensions on Big Query to produce the data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.