100+ datasets found
  1. Search-Based Test Data Generation for SQL Queries: Appendix

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeroen Castelein; Maurício Aniche; Maurício Aniche; Mozhan Soltani; Annibale Panichella; Arie van Deursen; Jeroen Castelein; Mozhan Soltani; Annibale Panichella; Arie van Deursen (2020). Search-Based Test Data Generation for SQL Queries: Appendix [Dataset]. http://doi.org/10.5281/zenodo.1166023
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jeroen Castelein; Maurício Aniche; Maurício Aniche; Mozhan Soltani; Annibale Panichella; Arie van Deursen; Jeroen Castelein; Mozhan Soltani; Annibale Panichella; Arie van Deursen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".

    The appendix contains:

    • The queries from the three open source systems we used in the evaluation of our tool (the industry software system is not part of this appendix, due to privacy reasons)
    • The results of our evaluation.
    • The source code of the tool. Most recent version can be found at https://github.com/SERG-Delft/evosql.
    • The results of the tuning procedure we conducted before running the final evaluation.
  2. Z

    Data from: SQL Injection Attack Netflow

    • data.niaid.nih.gov
    • portalcienciaytecnologia.jcyl.es
    • +3more
    Updated Sep 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ignacio Crespo; Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6907251
    Explore at:
    Dataset updated
    Sep 28, 2022
    Authors
    Ignacio Crespo; Adrián Campazas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

    NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

    Datasets

    The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

    The datasets contain both benign and malicious traffic. All collected datasets are balanced.

    The version of NetFlow used to build the datasets is 5.

        Dataset
        Aim
        Samples
        Benign-malicious
        traffic ratio
    
    
    
    
        D1
        Training
        400,003
        50%
    
    
        D2
        Test
        57,239
        50%
    

    Infrastructure and implementation

    Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

    DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

    Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

    The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

    The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

        Parameters
        Description
    
    
    
    
        '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
        Enumerate users, password hashes, privileges, roles, databases, tables and columns
    
    
        --level=5
        Increase the probability of a false positive identification
    
    
        --risk=3
        Increase the probability of extracting data
    
    
        --random-agent
        Select the User-Agent randomly
    
    
        --batch
        Never ask for user input, use the default behavior
    
    
        --answers="follow=Y"
        Predefined answers to yes
    

    Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

    The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

    However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

    To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.

  3. Data from: Text to SQL dataset

    • kaggle.com
    Updated Jul 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Nour Alawad (2024). Text to SQL dataset [Dataset]. https://www.kaggle.com/datasets/mohammadnouralawad/spider-text-sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohammad Nour Alawad
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset consists of 8,034 entries designed to evaluate the performance of text-to-SQL models. Each entry contains a natural language text query and its corresponding SQL command. The dataset is a subset derived from the Spider dataset, focusing on diverse and complex queries to challenge the understanding and generation capabilities of machine learning models.

  4. WikiSQL (Questions and SQL Queries)

    • kaggle.com
    zip
    Updated Nov 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dataset-for-developing-natural-language-interfac
    Explore at:
    zip(21491264 bytes)Available download formats
    Dataset updated
    Nov 25, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    WikiSQL (Questions and SQL Queries)

    80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

    By Huggingface Hub [source]

    About this dataset

    A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

    Research Ideas

    • This dataset can be used to develop natural language interfaces for relational databases.
    • This dataset can be used to develop a knowledge base of common SQL queries.
    • This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  5. h

    Data from: text-to-sql

    • huggingface.co
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mir Mudasir (2025). text-to-sql [Dataset]. https://huggingface.co/datasets/Mudasir692/text-to-sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2025
    Authors
    Mir Mudasir
    Description

    This dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.

  6. D

    SQL Generation AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). SQL Generation AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/sql-generation-ai-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL Generation AI Market Outlook



    According to our latest research, the global SQL Generation AI market size reached USD 1.42 billion in 2024, reflecting a robust expansion driven by the rapid adoption of artificial intelligence technologies in database management and analytics. The market is set to grow at a compelling CAGR of 27.6% from 2025 to 2033, with the total market size forecasted to reach USD 13.18 billion by 2033. This remarkable growth trajectory is primarily fueled by advancements in natural language processing, the increasing complexity of enterprise data environments, and the demand for automation in SQL query generation to enhance productivity and reduce operational costs.




    The primary growth factors propelling the SQL Generation AI market revolve around the escalating need for data-driven decision-making and the democratization of data access across organizations. As enterprises generate and store vast amounts of data, the ability to quickly and accurately extract actionable insights becomes critical. SQL Generation AI solutions, leveraging advanced machine learning and natural language processing algorithms, enable non-technical users to generate complex SQL queries using simple natural language instructions. This not only reduces the dependency on specialized database administrators but also accelerates the pace of business intelligence and analytics initiatives. The proliferation of self-service analytics and the integration of AI-powered query generation into popular business intelligence platforms further amplify market growth, making it easier for organizations to unlock the value of their data assets.




    Another significant driver is the ongoing digital transformation across various industries, which has led to the modernization of legacy IT infrastructures and the adoption of cloud-based data management solutions. Organizations are increasingly migrating their databases to the cloud to benefit from scalability, flexibility, and cost-efficiency. SQL Generation AI tools are being integrated with cloud data warehouses and analytics platforms, allowing for seamless query generation and real-time data analysis. This shift not only optimizes data workflows but also supports hybrid and multi-cloud strategies, enabling enterprises to manage and analyze data across diverse environments. The rising volume and diversity of data, coupled with the need for real-time insights, are compelling organizations to invest in AI-powered SQL generation to maintain a competitive edge.




    Additionally, the COVID-19 pandemic has accelerated the adoption of digital technologies, including AI-driven SQL generation, as organizations seek to automate routine tasks and enhance operational resilience. The growing emphasis on remote work and distributed teams has highlighted the importance of intuitive data access and collaboration tools. SQL Generation AI solutions facilitate seamless collaboration between business users and data teams, bridging the gap between technical and non-technical stakeholders. This has led to increased demand across sectors such as BFSI, healthcare, retail, and manufacturing, where timely data insights are crucial for strategic decision-making. The market is also witnessing heightened interest from small and medium enterprises, which are leveraging AI-powered SQL generation to level the playing field with larger competitors.




    Regionally, North America continues to dominate the SQL Generation AI market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The presence of major technology vendors, early adoption of AI and cloud technologies, and a strong focus on data-driven innovation contribute to North America's leadership position. Europe is witnessing rapid growth, driven by stringent data regulations and increasing investments in digital transformation initiatives. Meanwhile, Asia Pacific is emerging as a high-growth region, fueled by expanding IT infrastructure, a burgeoning startup ecosystem, and rising demand for advanced analytics solutions in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also showing promising growth potential as organizations in these regions accelerate their digital journeys.



    Component Analysis



    The SQL Generation AI market by component is broadly segmented into Software and Services. The software segment commands the majority market share, as organizations increasingly dep

  7. ⚙️ SQL Tutorial Exercise Data

    • kaggle.com
    zip
    Updated Oct 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2023). ⚙️ SQL Tutorial Exercise Data [Dataset]. https://www.kaggle.com/datasets/mexwell/sql-tutorial-exercise-data
    Explore at:
    zip(3701453 bytes)Available download formats
    Dataset updated
    Oct 2, 2023
    Authors
    mexwell
    Description

    This dataset was created to be the base of the data.world SQL tutorial exercises. Data was genererated using Synthea, a synthetic patient generator that models the medical history of synthetic patients. Their mission is to output high-quality synthetic, realistic but not real, patient data and associated health records covering every aspect of healthcare. The resulting data is free from cost, privacy, and security restrictions, enabling research with Health IT data that is otherwise legally or practically unavailable. De-identified real data still presents a challenge in the medical field because there are peopel who excel at re-identification of these data. For that reason the average medical center, etc. will not share their patient data. Most governmental data is at the hospital level. NHANES data is an exception.

    You can read Synthea's first academic paper here.

    Original Data

    Acknowlegement

    Foto von Rubaitul Azad auf Unsplash

  8. G

    Distributed SQL Database Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Distributed SQL Database Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/distributed-sql-database-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Distributed SQL Database Market Outlook



    According to our latest research, the global distributed SQL database market size stands at USD 2.35 billion in 2024, reflecting robust adoption across various industries. The market is projected to expand at a CAGR of 23.1% from 2025 to 2033, reaching a forecasted value of USD 17.53 billion by 2033. This remarkable growth trajectory is primarily driven by the increasing need for high availability, scalability, and fault tolerance in data management systems, as organizations worldwide accelerate their digital transformation initiatives and seek to manage ever-growing volumes of transactional and analytical data.



    Several key growth factors underpin the rapid expansion of the distributed SQL database market. First, the proliferation of cloud-native applications and microservices architectures has created unprecedented demand for databases that can scale horizontally and ensure consistency across globally distributed environments. Enterprises are increasingly migrating their mission-critical workloads to distributed SQL databases to overcome the limitations of traditional relational databases, particularly in terms of scalability, latency, and resilience. Additionally, the rise of Internet of Things (IoT) devices, mobile applications, and real-time analytics has led to an exponential increase in data generation, further accentuating the need for distributed SQL solutions that can handle massive, geographically dispersed datasets with low latency and high reliability.



    Another significant driver is the growing emphasis on business continuity and disaster recovery. Distributed SQL databases inherently offer advanced replication, automated failover, and self-healing capabilities, ensuring minimal downtime and data loss even in the event of hardware failures or network outages. This is particularly critical for industries such as banking, financial services, and insurance (BFSI), healthcare, and e-commerce, where uninterrupted access to data is essential for operational efficiency, regulatory compliance, and customer satisfaction. The integration of artificial intelligence and machine learning with distributed SQL platforms is further enhancing their capabilities, enabling predictive analytics, anomaly detection, and automated optimization of database operations.



    From a regional perspective, North America currently dominates the distributed SQL database market, accounting for over 38% of the global revenue in 2024, followed closely by Europe and the Asia Pacific region. The strong presence of technology giants, a mature cloud ecosystem, and early adoption of advanced database technologies contribute to North America's leadership position. Meanwhile, Asia Pacific is emerging as the fastest-growing region, fueled by rapid digitalization, expanding internet penetration, and increasing investments in cloud infrastructure by governments and enterprises. Latin America and the Middle East & Africa are also witnessing steady growth, albeit from a smaller base, as organizations in these regions recognize the value of distributed SQL databases in supporting their digital transformation journeys.





    Component Analysis



    The distributed SQL database market is broadly segmented by component into software and services. The software segment comprises database engines, management tools, and related middleware that facilitate the deployment, operation, and optimization of distributed SQL databases. This segment currently accounts for the largest share of the market, owing to the critical role that advanced software solutions play in enabling horizontal scaling, multi-region replication, and strong consistency guarantees. Vendors are continuously innovating to enhance the performance, security, and manageability of their offerings, incorporating features such as automated sharding, query optimization, and real-time monitoring. The adoption of open-source distributed SQL databases is also gaining momentum, as organizations seek to avoid vendor lock-in and reduce total cost of ownership.



    The se

  9. Bike Warehouse SQL Project

    • kaggle.com
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Safae Ahb (2025). Bike Warehouse SQL Project [Dataset]. https://www.kaggle.com/datasets/safaeahb/bike-warehouse-sql-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Safae Ahb
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SAP Bikes Sales : SQL Project

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fdd8e395e5d70bde9279f0f653b4bc2bf%2FGemini_Generated_Image_cvz71ncvz71ncvz7.jpg?generation=1736783649344014&alt=media" alt=""> This project involves analyzing and transforming data from a bike warehouse database using SQL. The goal is to clean, transform, and query the data to generate insights about products, employees, customers, sales, and trends.

    Overview

    The SAP Bikes Sales database contains various tables that represent business data for a bike warehouse, such as information on products, sales, employees, business partners, and more. This project focuses on cleaning and transforming data, optimizing database schema, and generating SQL queries to gain business insights.

    Key SQL Operations:

    1.**Data Cleaning & Transformation**: - Remove duplicate records from key tables. - Drop unnecessary columns and handle null values. - Populate new columns based on existing data. - Merge related tables to create new insights. 2.**Business Insights Queries**: - Top-selling Products: Identify products with the highest sales quantities and total revenue. - Sales Performance by Product Category: Analyze revenue and order counts by product category. - Employee Sales Performance: Track employees' contribution to sales volumes and revenue. - Customer Segmentation: Examine the number of orders placed by business partners and their total sales value. - Sales Trends: Analyze sales trends over time and calculate average order values.

    Tables Involved

    • Addresses: Contains information about addresses: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F1a5b39b4f402dfce31ea25d6d53c2f38%2FAdresses%20Table.PNG?generation=1736780543250265&alt=media" alt="">
    • BusinessPartners: Contains details about business partners: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F416a9be40526650a4167dfdc565dfbe6%2FBusinessPartners%20Table.PNG?generation=1736780656503685&alt=media" alt="">
    • Employees: Contains employee information: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F5b99849bde2bc052cc1d6cc7d52fb67d%2FEmployees%20Table.PNG?generation=1736780677194831&alt=media" alt="">
    • ProductCategories & ProductCategoryText: Describe product categories and their descriptions: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F928f9aeb937c2fdc8d8860cc8d23f9d7%2FProductCategories%20Table.PNG?generation=1736780784495223&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fe148078e53777ca1180c5adf6cec7dda%2FProductCategory%20Text%20Table.PNG?generation=1736780831995071&alt=media" alt="">
    • Products & ProductTexts: Contain product details and product descriptions: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fdd4eb334332ec5d9248ccb8b737dd2df%2FProducts%20Table.PNG?generation=1736780894684724&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Faceb93b69951b1fde1f46bac146a9aa0%2FProductTexts%20Table.PNG?generation=1736782044055973&alt=media" alt="">
    • SalesOrderItems: Contains details of individual items within a sales order: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fe0ba42ddc00634ce1728e013dbeb231c%2FSalesOrderItemsTable.PNG?generation=1736781074515668&alt=media" alt="">
    • SalesOrders: Contains information about sales orders: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F0a67d6ba6ded143676db9f0b4f4dfcb0%2FSalesOrders%20Table.PNG?generation=1736781089531236&alt=media" alt="">

    Key SQL Queries

    1. Data Cleaning and Transformation:

    -**Addresses Table**: -Checking for duplicates ADDRESSID. -**BusinessPartners Table**: -Handled duplicates, missing or incorrect data. -Dropped the unnecessary FAXNUMBER column because it was empty. -**Employee Table**: -Dropped unnecessary columns. -Populated NAME_INITIALS based on employee's first, middle, and last name initials. -Fixed column type issues. -**Product Categories and Product Texts**: -Merged ProductCategories and ProductCategoryText tables into a new CombinedProductCategories table for easy analysis. -**Products Table**: -Dropped irrelevant columns such as WIDTH, DEPTH, HEIGHT, etc. -**Sales Order Items Table**: -Fixed null values in GROSSAMOUNT and created a TOTALGROSSAMOUNT column to track sales volume.

    ###2. Database Diagram and Relationships In addition to the data cleaning and analysis, a database diagram has been create...

  10. h

    synthetic_text_to_sql

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      synthetic_text_to_sql
    

    gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

    105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

  11. G

    SQL-Verified Generation Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). SQL-Verified Generation Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sql-verified-generation-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL-Verified Generation Market Outlook



    According to our latest research, the global SQL-Verified Generation market size reached USD 2.18 billion in 2024, driven by the increasing demand for data integrity, compliance, and advanced analytics across industries. The market is expected to grow at a robust CAGR of 13.5% from 2025 to 2033, reaching a projected value of USD 6.47 billion by 2033. This growth is primarily attributed to the surge in data-driven decision-making, the proliferation of digital transformation initiatives, and the rising regulatory requirements for data validation and security worldwide. As organizations continue to prioritize accurate, reliable, and compliant data operations, the adoption of SQL-Verified Generation solutions is set to accelerate steadily over the forecast period.



    One of the primary growth factors propelling the SQL-Verified Generation market is the exponential increase in data volumes generated by enterprises across various sectors. As organizations collect and process vast amounts of structured and unstructured data, the need for robust SQL-based verification tools becomes critical to ensure data accuracy, consistency, and reliability. These solutions enable businesses to automate data validation processes, minimize human errors, and enhance the overall quality of business intelligence outputs. Furthermore, the integration of artificial intelligence and machine learning with SQL-Verified Generation platforms is enabling advanced analytics and real-time data validation, further amplifying market growth. The ongoing digital transformation across industries such as BFSI, healthcare, and retail is pushing organizations to invest in scalable and secure data management solutions, thereby fueling the demand for SQL-Verified Generation systems.



    Another significant driver for market expansion is the stringent regulatory landscape governing data privacy, security, and compliance. With regulations such as GDPR, HIPAA, and CCPA becoming more rigorous, organizations are under increasing pressure to ensure that their data processing and storage practices are fully compliant. SQL-Verified Generation solutions play a vital role in automating compliance checks, auditing data flows, and providing transparent reporting mechanisms. These capabilities not only help organizations avoid hefty fines and reputational damage but also build trust with stakeholders and customers. The growing emphasis on data governance and risk management is prompting enterprises to adopt advanced SQL-Verified Generation tools that offer comprehensive compliance and auditing functionalities, thereby contributing to sustained market growth.



    Technological advancements and the shift towards cloud-based infrastructures are also catalyzing the adoption of SQL-Verified Generation solutions. Cloud deployment offers unparalleled scalability, flexibility, and cost-effectiveness, making it an attractive option for organizations of all sizes. The increasing integration of SQL-Verified Generation tools with cloud-based platforms and enterprise resource planning (ERP) systems is streamlining data integration, validation, and reporting processes. This seamless integration is not only enhancing operational efficiency but also enabling organizations to leverage real-time insights for strategic decision-making. As cloud adoption continues to rise, especially among small and medium enterprises, the market for SQL-Verified Generation solutions is poised for significant expansion in the coming years.



    From a regional perspective, North America and Europe are currently leading the SQL-Verified Generation market, driven by the presence of major technology providers, high digital adoption rates, and strict regulatory frameworks. However, the Asia Pacific region is emerging as a lucrative market, fueled by rapid digitalization, increasing investments in IT infrastructure, and a growing focus on data-driven business strategies. Countries such as China, India, and Japan are witnessing substantial growth in sectors like BFSI, healthcare, and e-commerce, which is boosting the demand for advanced data validation and compliance solutions. As organizations in emerging economies continue to embrace digital transformation, the SQL-Verified Generation market is expected to witness robust growth across all major regions.



  12. Z

    Stack Overflow Chat Dutch

    • data.niaid.nih.gov
    Updated Jun 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379
    Explore at:
    Dataset updated
    Jun 20, 2023
    Authors
    Vanroy, Bram
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

    Data Instances

    {'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

    Data Fields

    id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

    input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

    topic: the topic description

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

    Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

    Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

    """

    The prompt to translate the topic is:

    TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
    " summarizing manner. Translate accurately and formally. Do not provide any explanation"
    " about the translation and do not include the original title.

    "

    The system message was:

    You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

    Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.

  13. D

    Database Testing Tool Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Database Testing Tool Report [Dataset]. https://www.archivemarketresearch.com/reports/database-testing-tool-26309
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Feb 9, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global database testing tool market is anticipated to experience substantial growth in the coming years, driven by factors such as the increasing adoption of cloud-based technologies, the rising demand for data quality and accuracy, and the growing complexity of database systems. The market is expected to reach a value of USD 1,542.4 million by 2033, expanding at a CAGR of 7.5% during the forecast period of 2023-2033. Key players in the market include Apache JMeter, DbFit, SQLMap, Mockup Data, SQL Test, NoSQLUnit, Orion, ApexSQL, QuerySurge, DBUnit, DataFactory, DTM Data Generator, Oracle, SeLite, SLOB, and others. The North American region is anticipated to hold a significant share of the database testing tool market, followed by Europe and Asia Pacific. The increasing adoption of cloud-based database testing services, the presence of key market players, and the growing demand for data testing and validation are driving the market growth in North America. Asia Pacific, on the other hand, is expected to experience the highest growth rate due to the rapidly increasing IT spending, the emergence of new technologies, and the growing number of businesses investing in data quality management solutions.

  14. d

    dune-sql-generate-large-series-sample

    • dune.com
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    springzhang (2025). dune-sql-generate-large-series-sample [Dataset]. https://dune.com/discover/content/relevant?q=author:springzhang&resource-type=queries
    Explore at:
    Dataset updated
    Oct 23, 2025
    Authors
    springzhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Blockchain data query: dune-sql-generate-large-series-sample

  15. (Sunset)📒 Meta Kaggle ported to MS SQL SERVER

    • kaggle.com
    zip
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2024). (Sunset)📒 Meta Kaggle ported to MS SQL SERVER [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-ported-to-sql-server-2022-database
    Explore at:
    zip(8635902534 bytes)Available download formats
    Dataset updated
    Mar 20, 2024
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I've always wanted to explore Kaggle's Meta Kaggle dataset but I am more comfortable on using TSQL when it comes to writing (very) complex queries. Also, I tend to write queries faster when using SQL MANAGEMENT STUDIO, like 100x faster. So, I ported Kaggle's Meta Kaggle dataset into MS SQL SERVER 2022 database format, created a backup file, then uploaded it here.

    • MSSQL VERSION: SQL Server 2022
    • Collation: SQL_Latin1_General_CP1_CI_AS
    • Recovery model: simple

    Requirements

    • Download and install the SQL SERVER 2022 Developer edition here
    • Download the backup file
    • Restore the backup file into your local. If you havent done this before, it's easy and straightforward. Here is a guide.

    (QUOTED FROM THE ORIGINAL DATASET)

    Meta Kaggle

    Explore Kaggle's public data on competitions, datasets, kernels (code/ notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but they think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F2ad97bce7839d6e57674e7a82981ed23%2F2Egeb8R.png?generation=1688912953875842&alt=media" alt="">

    Notes

  16. G

    SQL Query Engine Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). SQL Query Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sql-query-engine-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL Query Engine Market Outlook



    According to our latest research, the global SQL Query Engine market size in 2024 stands at USD 3.84 billion, reflecting robust growth driven by the increasing demand for efficient data management and analytics solutions across industries. The market is projected to expand at a CAGR of 12.1% from 2025 to 2033, reaching an estimated value of USD 10.77 billion by the end of the forecast period. This remarkable growth is underpinned by the escalating volume of structured and unstructured data, the proliferation of cloud-based applications, and the widespread adoption of advanced analytics and business intelligence tools.



    One of the primary growth factors driving the SQL Query Engine market is the exponential increase in data generation from digital transformation initiatives, IoT devices, and enterprise applications. Organizations are increasingly relying on SQL query engines to extract actionable insights from vast datasets, enabling informed decision-making and operational efficiency. The integration of SQL engines with big data platforms and cloud environments further amplifies their utility, as businesses seek scalable and high-performance solutions that can seamlessly handle complex queries across distributed data sources. This trend is particularly pronounced in industries such as BFSI, healthcare, and retail, where real-time data analysis is critical for competitive advantage and regulatory compliance.



    Another significant driver is the rapid evolution of cloud computing and the migration of enterprise workloads to cloud platforms. Cloud-based SQL query engines offer flexibility, scalability, and cost-effectiveness, making them highly attractive to organizations looking to modernize their IT infrastructure. The ability to run SQL queries on cloud-native data warehouses and integrate with various analytics tools has democratized access to advanced data capabilities, even for small and medium enterprises. Furthermore, innovations in query optimization, parallel processing, and support for hybrid and multi-cloud deployments are fostering greater adoption of SQL query engines across diverse business environments.



    The market is also benefiting from the growing emphasis on business intelligence and data-driven decision-making. Enterprises are leveraging SQL query engines to power dashboards, generate real-time reports, and facilitate self-service analytics for non-technical users. Enhanced support for structured query language, improved user interfaces, and integration with visualization tools are making it easier for business users to interact with data, driving broader usage across organizations. Additionally, the rise of data integration and analytics as core business functions is pushing vendors to continuously innovate, offering advanced features such as in-memory processing, machine learning integration, and support for semi-structured data formats.



    Regionally, North America continues to dominate the SQL Query Engine market, accounting for the largest revenue share in 2024. This is attributed to the strong presence of technology giants, early adoption of cloud technologies, and a thriving ecosystem of data-driven enterprises. However, Asia Pacific is expected to exhibit the fastest growth during the forecast period, fueled by rapid digitalization, increasing investments in cloud infrastructure, and the emergence of new business models in countries such as China, India, and Japan. Europe, Latin America, and the Middle East & Africa are also witnessing steady growth, supported by regulatory mandates for data governance and the rising importance of analytics in public and private sectors.





    Component Analysis



    The SQL Query Engine market is segmented by component into Software and Services. The software segment commands a substantial share of the market, as enterprises increasingly invest in advanced query engines to enhance their data processing and analytics capabilities. Modern SQL query engine software offers robust features such as distributed query pro

  17. Bike Store Relational Database | SQL

    • kaggle.com
    zip
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dillon Myrick (2023). Bike Store Relational Database | SQL [Dataset]. https://www.kaggle.com/datasets/dillonmyrick/bike-store-sample-database
    Explore at:
    zip(94412 bytes)Available download formats
    Dataset updated
    Aug 21, 2023
    Authors
    Dillon Myrick
    Description

    This is the sample database from sqlservertutorial.net. This is a great dataset for learning SQL and practicing querying relational databases.

    Database Diagram:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Fc5838eb006bab3938ad94de02f58c6c1%2FSQL-Server-Sample-Database.png?generation=1692609884383007&alt=media" alt="">

    Terms of Use

    The sample database is copyrighted and cannot be used for commercial purposes. For example, it cannot be used for the following but is not limited to the purposes: - Selling - Including in paid courses

  18. G

    Guarded SQL Generation Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Guarded SQL Generation Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/guarded-sql-generation-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Guarded SQL Generation Market Outlook



    According to our latest research, the Guarded SQL Generation market size reached USD 1.26 billion in 2024, reflecting robust demand across key verticals. The market is expected to grow at a CAGR of 17.4% from 2025 to 2033, culminating in a forecasted value of USD 5.01 billion by 2033. This rapid expansion is propelled by growing data security concerns, the proliferation of digital transformation initiatives, and the increasing adoption of automated solutions to mitigate SQL injection and related cyber threats.




    The primary growth driver for the Guarded SQL Generation market is the exponential rise in cyber-attacks targeting databases, particularly SQL injection attacks. Organizations across sectors are facing heightened risks associated with data breaches, pushing them to invest in advanced SQL generation tools that automatically produce secure, parameterized queries. These solutions significantly reduce the attack surface by ensuring that SQL code is generated in a way that inherently guards against vulnerabilities. The integration of artificial intelligence and machine learning into these tools further enhances their ability to detect and neutralize evolving threats, making them indispensable in today’s security-conscious environment. The growing regulatory landscape, with stringent data protection standards such as GDPR and CCPA, is also compelling enterprises to adopt robust SQL generation solutions, as compliance failures can result in substantial penalties and reputational damage.




    A secondary, but equally important, factor fueling the growth of the Guarded SQL Generation market is the ongoing digital transformation across industries. As organizations migrate workloads to the cloud and embrace microservices architectures, the complexity and volume of database interactions have surged. This has made traditional, manual approaches to SQL code generation both impractical and risky. Automated, guarded SQL generation tools provide a scalable and efficient solution, enabling businesses to maintain speed and agility without compromising security. The rapid adoption of DevOps and continuous integration/continuous deployment (CI/CD) pipelines has further accelerated the need for such solutions, as development cycles shorten and the margin for error narrows. Consequently, vendors are focusing on delivering seamless integration capabilities, user-friendly interfaces, and comprehensive support for a wide range of database environments.




    Moreover, the rising adoption of cloud-based applications and the proliferation of SaaS platforms have created additional growth opportunities for the Guarded SQL Generation market. Enterprises are increasingly seeking cloud-native solutions that can be rapidly deployed and scaled to meet fluctuating demands. Cloud deployment not only offers cost advantages but also enhances collaboration and accessibility, making it an attractive choice for organizations of all sizes. Vendors are responding by offering flexible, subscription-based pricing models and robust APIs that facilitate integration with existing cloud ecosystems. This shift is particularly pronounced among small and medium enterprises (SMEs), which often lack the resources to maintain dedicated security teams and are thus highly motivated to adopt automated, managed solutions.




    From a regional perspective, North America continues to dominate the Guarded SQL Generation market, driven by a mature IT infrastructure, high awareness of cybersecurity threats, and significant investments in cutting-edge technologies. However, Asia Pacific is emerging as the fastest-growing region, supported by rapid digitalization, a burgeoning startup ecosystem, and increasing regulatory scrutiny. Europe is also witnessing steady growth, buoyed by strict data privacy laws and a proactive approach to cybersecurity. Latin America and the Middle East & Africa are gradually catching up, with governments and enterprises recognizing the importance of robust database security solutions in safeguarding critical information assets.





    <h2 id='component-analysis' &g

  19. Data from: Automating pharmacovigilance evidence generation: Using large...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery Painter; Venkateswara Chalamalasetti; Raymond Kassekert; Andrew Bate (2025). Automating pharmacovigilance evidence generation: Using large language models to produce context-aware SQL [Dataset]. http://doi.org/10.5061/dryad.2280gb63n
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    GlaxoSmithKline (United Kingdom)
    GlaxoSmithKline (United States)
    Authors
    Jeffery Painter; Venkateswara Chalamalasetti; Raymond Kassekert; Andrew Bate
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To enhance the accuracy of information retrieval from pharmacovigilance (PV) databases by employing Large Language Models (LLMs) to convert natural language queries (NLQs) into Structured Query Language (SQL) queries, leveraging a business context document. Materials and Methods: We utilized OpenAI’s GPT-4 model within a retrieval-augmented generation (RAG) framework, enriched with a business context document, to transform NLQs into executable SQL queries. Each NLQ was presented to the LLM randomly and independently to prevent memorization. The study was conducted in three phases, varying query complexity, and assessing the LLM's performance both with and without the business context document. Results: Our approach significantly improved NLQ-to-SQL accuracy, increasing from 8.3% with the database schema alone to 78.3% with the business context document. This enhancement was consistent across low, medium, and high complexity queries, indicating the critical role of contextual knowledge in query generation. Discussion: The integration of a business context document markedly improved the LLM's ability to generate accurate SQL queries (i.e. both executable and returning semantically appropriate results). Performance achieved a maximum of 85% when high complexity queries are excluded, suggesting promise for routine deployment. Conclusion: This study presents a novel approach to employing LLMs for safety data retrieval and analysis, demonstrating significant advancements in query generation accuracy. The methodology offers a framework applicable to various data-intensive domains, enhancing the accessibility of information retrieval for non-technical users. Methods Test set of NLQ's used in the paper Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL. Also included are the Python scripts for the LLM processing, the R code for statistical analysis of results, and a copy of the business context document and essential tables.

  20. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jeroen Castelein; Maurício Aniche; Maurício Aniche; Mozhan Soltani; Annibale Panichella; Arie van Deursen; Jeroen Castelein; Mozhan Soltani; Annibale Panichella; Arie van Deursen (2020). Search-Based Test Data Generation for SQL Queries: Appendix [Dataset]. http://doi.org/10.5281/zenodo.1166023
Organization logo

Search-Based Test Data Generation for SQL Queries: Appendix

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jeroen Castelein; Maurício Aniche; Maurício Aniche; Mozhan Soltani; Annibale Panichella; Arie van Deursen; Jeroen Castelein; Mozhan Soltani; Annibale Panichella; Arie van Deursen
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".

The appendix contains:

  • The queries from the three open source systems we used in the evaluation of our tool (the industry software system is not part of this appendix, due to privacy reasons)
  • The results of our evaluation.
  • The source code of the tool. Most recent version can be found at https://github.com/SERG-Delft/evosql.
  • The results of the tuning procedure we conducted before running the final evaluation.
Search
Clear search
Close search
Google apps
Main menu