https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
WikiSQL Dataset (Reformatted for Generative Models)
This is the exact same dataset as WikiSQL: https://huggingface.co/datasets/wikisql, but with the data reformatted to allow direct use with text generation LLMs. The original license and credits for the original dataset remain in place. Specifically, the changes from standard WikiSQL are:
The table details in WikiSQL were included as dictionaries but tools like LangChain and LlamaIndex build their prompts using a SQL DESCRIBE of… See the full description on the dataset page: https://huggingface.co/datasets/tjaffri/wikisql-generate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".
The appendix contains:
The queries from the three open source systems we used in the evaluation of our tool (the industry software system is not part of this appendix, due to privacy reasons)
The results of our evaluation.
The source code of the tool. Most recent version can be found at https://github.com/SERG-Delft/evosql.
The results of the tuning procedure we conducted before running the final evaluation.
This dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset | Aim | Samples | Benign-malicious traffic ratio |
---|---|---|---|
D1 | Training | 400,003 | 50% |
D2 | Test | 57,239 | 50% |
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters | Description |
---|---|
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' | Enumerate users, password hashes, privileges, roles, databases, tables and columns |
--level=5 | Increase the probability of a false positive identification |
--risk=3 | Increase the probability of extracting data |
--random-agent | Select the User-Agent randomly |
--batch | Never ask for user input, use the default behavior |
--answers="follow=Y" | Predefined answers to yes |
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12.
Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
synthetic_text_to_sql
gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:
105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global database testing tool market is anticipated to experience substantial growth in the coming years, driven by factors such as the increasing adoption of cloud-based technologies, the rising demand for data quality and accuracy, and the growing complexity of database systems. The market is expected to reach a value of USD 1,542.4 million by 2033, expanding at a CAGR of 7.5% during the forecast period of 2023-2033. Key players in the market include Apache JMeter, DbFit, SQLMap, Mockup Data, SQL Test, NoSQLUnit, Orion, ApexSQL, QuerySurge, DBUnit, DataFactory, DTM Data Generator, Oracle, SeLite, SLOB, and others. The North American region is anticipated to hold a significant share of the database testing tool market, followed by Europe and Asia Pacific. The increasing adoption of cloud-based database testing services, the presence of key market players, and the growing demand for data testing and validation are driving the market growth in North America. Asia Pacific, on the other hand, is expected to experience the highest growth rate due to the rapidly increasing IT spending, the emergence of new technologies, and the growing number of businesses investing in data quality management solutions.
Objective: To enhance the accuracy of information retrieval from pharmacovigilance (PV) databases by employing Large Language Models (LLMs) to convert natural language queries (NLQs) into Structured Query Language (SQL) queries, leveraging a business context document. Materials and Methods: We utilized OpenAI’s GPT-4 model within a retrieval-augmented generation (RAG) framework, enriched with a business context document, to transform NLQs into executable SQL queries. Each NLQ was presented to the LLM randomly and independently to prevent memorization. The study was conducted in three phases, varying query complexity, and assessing the LLM's performance both with and without the business context document. Results: Our approach significantly improved NLQ-to-SQL accuracy, increasing from 8.3% with the database schema alone to 78.3% with the business context document. This enhancement was consistent across low, medium, and high complexity queries, indicating the critical role of contextual ..., Test set of NLQ's used in the paper Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL. Also included are the Python scripts for the LLM processing, the R code for statistical analysis of results, and a copy of the business context document and essential tables., , # Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL
https://doi.org/10.5061/dryad.2280gb63n
NLQ_Queries.xls contains the set of test NLQs along with the results of the LLM response in each phase of the experiment. Each NLQ also contains the complexity scores computed for each.
The business context document is supplied as a PDF, together with the Python and R code used to generate our results. The essential tables used in Phase 2 and 3 of the experiment are included in the text file.
Description:Â Contains all NLQ queries with the results of the LLM output and the pass, fail status of each.
Column Definitions:
Below are the column names in order with a detailed description.
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.
Data Instances
{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},
Data Fields
id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]
input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].
topic: the topic description
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.
The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as [|AI|]
) and the human ([|Human|]
) talking in turns and responding to each other;
2. do not translate the speaker identifiers [|AI|]
and [|Human|]
but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The prompt to translate the topic is:
TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.
"
The system message was:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].
The translation quality has not been verified. Use at your own risk!
Licensing Information
Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The Non-Relational SQL market size is projected to grow from USD 4.7 billion in 2023 to USD 15.8 billion by 2032, at a compound annual growth rate (CAGR) of 14.5% during the forecast period. This significant growth can be attributed to the rising demand for scalable and flexible database management solutions that efficiently handle large volumes of unstructured data.
One of the primary growth factors driving the Non-Relational SQL market is the exponential increase in data generation from various sources such as social media, IoT devices, and enterprise applications. As businesses seek to leverage this data for gaining insights and making informed decisions, the need for databases that can manage and process unstructured data efficiently has become paramount. Non-Relational SQL databases, such as document stores and graph databases, provide the required flexibility and scalability, making them an ideal choice for modern data-driven enterprises.
Another significant growth factor is the increasing adoption of cloud-based solutions. Cloud deployment offers numerous advantages, including reduced infrastructure costs, scalability, and easier management. These benefits have led to a surge in the adoption of Non-Relational SQL databases hosted on cloud platforms. Major cloud service providers like Amazon Web Services, Microsoft Azure, and Google Cloud offer robust Non-Relational SQL database services, further fueling market growth. Additionally, the integration of AI and machine learning with Non-Relational SQL databases is expected to enhance their capabilities, driving further adoption.
The rapid advancement in technology and the growing need for real-time data processing and analytics are also propelling the market's growth. Non-Relational SQL databases are designed to handle high-velocity data and provide quick query responses, making them suitable for real-time applications such as fraud detection, recommendation engines, and personalized marketing. As organizations increasingly rely on real-time data to enhance customer experiences and optimize operations, the demand for Non-Relational SQL databases is set to rise.
Regional outlook indicates that North America holds the largest share of the Non-Relational SQL market, driven by the presence of major technology companies and early adoption of advanced database technologies. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, fueled by the rapid digital transformation initiatives and increasing investments in cloud infrastructure. Europe and Latin America also present significant growth opportunities due to the rising adoption of big data and analytics solutions.
When analyzing the Non-Relational SQL market by database type, we observe that document stores hold a significant share of the market. Document stores, such as MongoDB and Couchbase, are particularly favored for their ability to store, retrieve, and manage document-oriented information. These databases are highly flexible, allowing for the storage of complex data structures and providing an intuitive query language. The increasing adoption of document stores can be ascribed to their ease of use and adaptability to various application requirements, making them a popular choice among developers and businesses.
Key-Value stores represent another crucial segment of the Non-Relational SQL market. These databases are known for their simplicity and high performance, making them ideal for caching, session management, and real-time data processing applications. Redis and Amazon DynamoDB are prominent examples of key-value stores that have gained widespread acceptance. The growing need for low-latency data access and the ability to handle massive volumes of data efficiently are key drivers for the adoption of key-value stores in various industries.
The market for column stores is also expanding as businesses require databases that can handle large-scale analytical queries efficiently. Columnar storage formats, such as Apache Cassandra and HBase, optimize read and write performance for analytical processing, making them suitable for big data analytics and business intelligence applications. The ability to perform complex queries on large datasets quickly is a significant advantage of column stores, driving their adoption in industries that rely heavily on data analytics.
Graph databases, such as Neo4j and Amazon Neptune, are gaining traction due to their ability to model
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column… See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
It contains the following files:
- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license
The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.
This dataset is distributed under the CC BY-SA 4.0 license.
If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.
@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}
@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}
@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}
@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}
@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}
@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}
@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}
@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}
@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}
@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
ckanext-sql Due to the absence of a README file in the provided GitHub repository for ckanext-sql, a comprehensive understanding of its features, integration, and benefits is unfortunately not available. Typically, an extension named 'sql' would likely bridge CKAN with SQL databases, potentially enabling users to query and interact with datasets stored in SQL-compatible databases directly from within CKAN. However, lacking specific documentation, definitive claims about its capabilities cannot be accurately made. Potential Key Features (based on the name and typical use cases): * SQL Query Interface: Hypothetically, this extension might offer an interface within CKAN to run SQL queries against linked datasets. * Data Visualization from SQL: Potentially, it could allow generating visualizations directly from data retrieved via SQL queries. * SQL Data Import: It is possible that the extension could provide functionality to import data from SQL databases into CKAN datasets. * Federated Queries: Maybe, the extension implements capability of running federated queries across datasets store as CKAN resources and external databases. * SQL Data Export: Possibility of offering the ability to export CKAN data to a SQL database. * SQL based resource views: Speculatively add different views for resource showing data from SQL Potential Use Cases (based on the name): 1. Direct Data Analysis: Data analysts might use this to directly query and analyze data stored in SQL databases via CKAN, skipping manually importing the data. 2. Database Integration: Organizations that already have large databases of data could use this extension to provide easier access to this data through a CKAN portal. Technical Integration (Hypothetical): Given the name, the 'sql' extension likely integrates with CKAN by adding new API endpoints or UI elements that allow users to specify SQL connections and queries. It would probably require configuration settings to define database connection parameters. It might also integrate with CKAN's resource view system, enabling custom visualizations. Potential Benefits & Impact (Speculative): If the extension functions as expected by the name, it would offer direct access to SQL data within the CKAN environment, reduce the need for data duplication (by querying directly rather than importing), and potentially enhance data analysis and visualization capabilities. The extension could become an indispensable part of data analytic workflows involving CKAN. However, due to a lack of a README.md, this analysis remains at theoretical level.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplement to: Enhancing Resource-based Test Case Generation For RESTful APIs with SQL Handling
In this repository, we provide tests and their coverage reports (conducted by Intellij coverage reports) that are used in the Discussion section of the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains resources that describe the DLR information model, database design and standard queries.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global database market size was valued at approximately USD 67 billion in 2023 and is projected to reach USD 138 billion by 2032, growing at a compound annual growth rate (CAGR) of 8.3%. The market is poised for significant growth due to the increasing demand for data storage solutions and the rapid digital transformation across various industries. As businesses continue to generate massive volumes of data, the need for efficient and scalable database solutions is becoming more critical than ever. This growth is further propelled by advancements in cloud computing and the increasing adoption of artificial intelligence and machine learning technologies, which require robust database management systems to handle complex data sets.
One of the primary growth factors for the database market is the exponential increase in data generation from various sources, including social media, IoT devices, and enterprise applications. As organizations strive to leverage data for competitive advantage, the demand for sophisticated database technologies that can manage, process, and analyze large volumes of data is on the rise. These technologies enable businesses to gain actionable insights, improve decision-making, and enhance customer experiences. Additionally, the proliferation of connected devices and the Internet of Things (IoT) are contributing to the surge in data volume, necessitating the deployment of advanced database systems to handle the influx of information efficiently.
The cloud computing revolution is another significant growth driver for the database market. With the increasing adoption of cloud-based services, organizations are shifting from traditional on-premises database solutions to cloud-based database management systems. This transition is driven by the need for scalability, flexibility, and cost-effectiveness, as cloud solutions offer the ability to scale resources up or down based on demand. Cloud databases also provide enhanced data security, disaster recovery, and backup solutions, making them an attractive option for businesses of all sizes. Moreover, cloud service providers continuously innovate by offering managed database services, reducing the burden on IT departments and allowing organizations to focus on core business activities.
The rise of artificial intelligence (AI) and machine learning (ML) technologies is also playing a crucial role in shaping the future of the database market. These technologies require robust and dynamic database systems capable of handling complex algorithms and large data sets. Databases optimized for AI and ML applications enable organizations to harness the power of predictive analytics, automation, and data-driven decision-making. The integration of AI and ML with database systems enhances the ability to identify patterns, detect anomalies, and predict future trends, further driving the demand for advanced database solutions.
From a regional perspective, North America is expected to dominate the database market, owing to the presence of established technology companies and the rapid adoption of advanced technologies. The region's mature IT infrastructure and the increasing need for data-driven insights in various industries contribute to the market's growth. Asia Pacific is anticipated to witness the highest growth rate during the forecast period, driven by the increasing digitization efforts, rising internet penetration, and the growing popularity of cloud-based solutions. Europe is also expected to experience significant growth due to the expanding IT sector and the increasing adoption of data analytics solutions across industries.
The database market can be segmented by type into relational, non-relational, cloud, and others. Relational databases are among the oldest and most established types of database systems, widely used across industries due to their ability to handle structured data efficiently. These databases rely on structured query language (SQL) for managing and manipulating data, making them suitable for applications that require complex querying and transaction processing. Despite their maturity, relational databases continue to evolve, with advancements such as NewSQL and distributed SQL databases enhancing their scalability and performance for modern applications.
Non-relational databases, also known as NoSQL databases, have gained popularity in recent years due to their flexibility and ability to handle unstructured data. These databases are designed to accommodate a diverse range of data types, making them ideal for applications involving large v
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Northwind Database
La base de datos Northwind es una base de datos de muestra creada originalmente por Microsoft y utilizada como base para sus tutoriales en una variedad de productos de bases de datos durante décadas. La base de datos de Northwind contiene datos de ventas de una empresa ficticia llamada "Northwind Traders", que importa y exporta alimentos especiales de todo el mundo. La base de datos Northwind es un excelente esquema tutorial para un ERP de pequeñas empresas, con clientes, pedidos, inventario, compras, proveedores, envíos, empleados y contabilidad de entrada única. Desde entonces, la base de datos Northwind ha sido trasladada a una variedad de bases de datos que no son de Microsoft, incluido PostgreSQL.
El conjunto de datos de Northwind incluye datos de muestra para lo siguiente.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13411583%2Fa52a5bbc3d8842abfdfcfe608b7a8d25%2FNorthwind_E-R_Diagram.png?generation=1718785485874540&alt=media" alt="">
Chinook DataBase
Chinook es una base de datos de muestra disponible para SQL Server, Oracle, MySQL, etc. Se puede crear ejecutando un único script SQL. La base de datos Chinook es una alternativa a la base de datos Northwind, siendo ideal para demostraciones y pruebas de herramientas ORM dirigidas a servidores de bases de datos únicos o múltiples.
El modelo de datos Chinook representa una tienda de medios digitales, que incluye tablas para artistas, álbumes, pistas multimedia, facturas y clientes.
Los datos relacionados con los medios se crearon utilizando datos reales de una biblioteca de iTunes. La información de clientes y empleados se creó manualmente utilizando nombres ficticios, direcciones que se pueden ubicar en mapas de Google y otros datos bien formateados (teléfono, fax, correo electrónico, etc.). La información de ventas se genera automáticamente utilizando datos aleatorios durante un período de cuatro años.
¿Por qué el nombre Chinook? El nombre de esta base de datos de ejemplo se basó en la base de datos Northwind. Los chinooks son vientos en el interior oeste de América del Norte, donde las praderas canadienses y las grandes llanuras se encuentran con varias cadenas montañosas. Los chinooks son más frecuentes en el sur de Alberta en Canadá. Chinook es una buena opción de nombre para una base de datos que pretende ser una alternativa a Northwind.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13411583%2Fd856e0358e3a572d50f1aba5e171c1c6%2FChinook%20DataBase.png?generation=1718785749657445&alt=media" alt="">
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset consists of 8,034 entries designed to evaluate the performance of text-to-SQL models. Each entry contains a natural language text query and its corresponding SQL command. The dataset is a subset derived from the Spider dataset, focusing on diverse and complex queries to challenge the understanding and generation capabilities of machine learning models.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global database servers market size is projected to grow from USD 14.8 billion in 2023 to USD 27.6 billion by 2032, exhibiting a compound annual growth rate (CAGR) of 7.2% during the forecast period. The primary growth factors contributing to this market include the ever-increasing volume of data generated across various sectors, the proliferation of cloud computing, and the growing need for effective data management solutions.
A significant driving force behind the growth of the database servers market is the exponential increase in data generation. With the advent of the Internet of Things (IoT), social media, e-commerce, and other digital platforms, the volume of data being generated has reached unprecedented levels. Organizations across different sectors are seeking robust database servers to manage, store, and analyze this data effectively. This surge in data generation is leading to a higher demand for scalable and efficient database servers, thereby propelling market growth.
Another crucial growth factor is the widespread adoption of cloud computing technologies. As businesses and enterprises transition from traditional on-premises infrastructure to cloud-based solutions, the demand for cloud database servers is on the rise. Cloud database servers offer several advantages, including cost efficiency, scalability, and flexibility, which make them an attractive option for organizations. The growing preference for cloud deployment models is further bolstering the database servers market, as more companies leverage cloud-based solutions to enhance their data management capabilities.
Additionally, the ongoing advancements in database server technologies are driving market growth. Innovations such as in-memory databases, NoSQL databases, and NewSQL databases are revolutionizing the way data is stored, managed, and retrieved. These advancements are enabling organizations to achieve faster data processing, improved performance, and enhanced reliability. As a result, businesses are increasingly adopting these advanced database server technologies to gain a competitive edge, thereby fueling the market's expansion.
The regional outlook for the database servers market reveals significant growth potential across various regions. North America, being a hub for technological advancements and home to several leading database server providers, is expected to dominate the market. The Asia Pacific region is also witnessing substantial growth, driven by the rapid digital transformation, increasing adoption of cloud computing, and the rising number of small and medium enterprises (SMEs). Europe, Latin America, and the Middle East & Africa are also anticipated to experience notable growth, supported by the increasing investments in IT infrastructure and the growing need for efficient data management solutions.
The database servers market is segmented into two primary types: relational database servers and non-relational database servers. Relational database servers, also known as SQL databases, have been the traditional choice for data storage and management. They use structured query language (SQL) to define and manipulate data, making them highly suitable for applications that require complex queries and transactions. The widespread adoption of relational database servers can be attributed to their robustness, reliability, and ability to handle structured data effectively. Many enterprises continue to rely on relational databases for mission-critical applications, ensuring steady growth in this segment.
On the other hand, non-relational database servers, also known as NoSQL databases, have gained significant traction in recent years. NoSQL databases are designed to handle unstructured and semi-structured data, offering flexibility and scalability that traditional relational databases often lack. With the rise of big data, IoT, and real-time analytics, non-relational databases have become increasingly popular. They are particularly well-suited for applications that require high-speed data processing, distributed computing, and horizontal scalability. As organizations strive to manage diverse and dynamic data types, the demand for non-relational database servers is expected to grow substantially.
The choice between relational and non-relational database servers depends on various factors, including the nature of the data, the complexity of queries, and the specific requirements of the application. While relational databases excel in handling structured data and complex transaction
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Summary
OGText2SQL dataset was utilized in training the OGSQL model, this dataset comprises over 350,000 rows of text-to-SQL pairs. Through a series of data refining steps, including schema expansion, SQL refinement, and instruction generation using existing Language Models (LLMs), the dataset was meticulously processed to ensure quality and relevance.
How to use it
Python
from datasets import load_dataset
dataset = load_dataset("OneGate/OGText2SQL")
API… See the full description on the dataset page: https://huggingface.co/datasets/OneGate/OGText2SQL.
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
WikiSQL Dataset (Reformatted for Generative Models)
This is the exact same dataset as WikiSQL: https://huggingface.co/datasets/wikisql, but with the data reformatted to allow direct use with text generation LLMs. The original license and credits for the original dataset remain in place. Specifically, the changes from standard WikiSQL are:
The table details in WikiSQL were included as dictionaries but tools like LangChain and LlamaIndex build their prompts using a SQL DESCRIBE of… See the full description on the dataset page: https://huggingface.co/datasets/tjaffri/wikisql-generate.