100+ datasets found
  1. Z

    Data from: SQL Injection Attack Netflow

    • data.niaid.nih.gov
    • portalcienciaytecnologia.jcyl.es
    • +3more
    Updated Sep 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ignacio Crespo; Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6907251
    Explore at:
    Dataset updated
    Sep 28, 2022
    Authors
    Ignacio Crespo; Adrián Campazas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

    NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

    Datasets

    The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

    The datasets contain both benign and malicious traffic. All collected datasets are balanced.

    The version of NetFlow used to build the datasets is 5.

        Dataset
        Aim
        Samples
        Benign-malicious
        traffic ratio
    
    
    
    
        D1
        Training
        400,003
        50%
    
    
        D2
        Test
        57,239
        50%
    

    Infrastructure and implementation

    Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

    DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

    Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

    The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

    The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

        Parameters
        Description
    
    
    
    
        '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
        Enumerate users, password hashes, privileges, roles, databases, tables and columns
    
    
        --level=5
        Increase the probability of a false positive identification
    
    
        --risk=3
        Increase the probability of extracting data
    
    
        --random-agent
        Select the User-Agent randomly
    
    
        --batch
        Never ask for user input, use the default behavior
    
    
        --answers="follow=Y"
        Predefined answers to yes
    

    Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

    The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

    However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

    To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.

  2. h

    synthetic_text_to_sql

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      synthetic_text_to_sql
    

    gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

    105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

  3. d

    dune-sql-generate-large-series-sample

    • dune.com
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    springzhang (2025). dune-sql-generate-large-series-sample [Dataset]. https://dune.com/discover/content/relevant?q=author:springzhang&resource-type=queries
    Explore at:
    Dataset updated
    Oct 23, 2025
    Authors
    springzhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Blockchain data query: dune-sql-generate-large-series-sample

  4. Z

    Stack Overflow Chat Dutch

    • data.niaid.nih.gov
    Updated Jun 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vanroy, Bram (2023). Stack Overflow Chat Dutch [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8052379
    Explore at:
    Dataset updated
    Jun 20, 2023
    Authors
    Vanroy, Bram
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

    Data Instances

    {'id': 9, 'input': 'Het gesprek tussen de mens en de AI-assistent. ' '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen ' 'van kolommen. ' '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp ' 'bij nodig? ' '[|Human|] Ik wil de output van een UNION-query ordenen op een ' 'kolom die niet aanwezig is in de SELECT-instructie. Is dat ' 'mogelijk? ' '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale ' 'positie van de kolom in de ORDER BY-clausule gebruiken in plaats ' 'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde ' 'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken. ' '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de ' 'ORDER BY-clausule? ' '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De ' 'alias moet worden gedefinieerd in de SELECT-instructie. Als je ' 'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen ' 'met ORDER BY col3. ' '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias ' 'gebruiken voor kolommen uit beide tabellen in de UNION? ' '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen ' 'uit beide tabellen in de UNION. Elke tabel in de UNION moet een ' 'unieke alias hebben voor de kolommen. ' '[|Human|] ', 'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

    Data Fields

    id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]

    input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent. and has at least one occurrence of both [|AI|] and [|Human|].

    topic: the topic description

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

    Here are the requirements that you should adhere to: 1. maintain the format: the conversation consists of the AI (marked as [|AI|]) and the human ([|Human|]) talking in turns and responding to each other; 2. do not translate the speaker identifiers [|AI|] and [|Human|] but always copy them into the translation in appropriate places; 3. ensure accurate translation and keep the correctness of the conversation; 4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 5. translate the human's text using informal, but standard, language; 6. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang}; 8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested; 9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

    Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.

    """

    The prompt to translate the topic is:

    TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct,"
    " summarizing manner. Translate accurately and formally. Do not provide any explanation"
    " about the translation and do not include the original title.

    "

    The system message was:

    You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

    Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.

  5. h

    SQL-GENERATOR-DATASETS

    • huggingface.co
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FRANKLIN (2024). SQL-GENERATOR-DATASETS [Dataset]. https://huggingface.co/datasets/fknguedia/SQL-GENERATOR-DATASETS
    Explore at:
    Dataset updated
    Dec 2, 2024
    Authors
    FRANKLIN
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    view code : https://colab.research.google.com/drive/1rLk-mdsWsdxwQdYYJS24rAP9KABtbiqu?usp=sharing

      Example :
    
    
    
    
    
      {"messages": [
    
    
    
    
    
      {"role": "system", "content": "You are a SQL expert assistant. Generate clear, efficient SQL queries based on user requests. Provide only the SQL query without any additional text or explanation."}
    
    
    
    
    
      {"role": "user", "content": "What are the top 5 most popular genres of music in the database, based on the number of tracks… See the full description on the dataset page: https://huggingface.co/datasets/fknguedia/SQL-GENERATOR-DATASETS.
    
  6. SQL Practice File 1

    • kaggle.com
    zip
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjana Murthy (2024). SQL Practice File 1 [Dataset]. https://www.kaggle.com/datasets/sanjanamurthy392/sql-file-1-create-database-use-create-etc
    Explore at:
    zip(431 bytes)Available download formats
    Dataset updated
    May 10, 2024
    Authors
    Sanjana Murthy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data contains Create database, Use, create table (int, varchar, date), describe, alter table (add, modify, char, varchar, after, rename column, to, drop column, drop), show tables, Rename table (to), Drop table.

  7. h

    sql-create-context

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Apr 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    brianm (2023). sql-create-context [Dataset]. https://huggingface.co/datasets/b-mc2/sql-create-context
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2023
    Authors
    brianm
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset builds from WikiSQL and Spider. There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column… See the full description on the dataset page: https://huggingface.co/datasets/b-mc2/sql-create-context.

  8. D

    SQL Generation AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). SQL Generation AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/sql-generation-ai-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL Generation AI Market Outlook



    According to our latest research, the global SQL Generation AI market size reached USD 1.42 billion in 2024, reflecting a robust expansion driven by the rapid adoption of artificial intelligence technologies in database management and analytics. The market is set to grow at a compelling CAGR of 27.6% from 2025 to 2033, with the total market size forecasted to reach USD 13.18 billion by 2033. This remarkable growth trajectory is primarily fueled by advancements in natural language processing, the increasing complexity of enterprise data environments, and the demand for automation in SQL query generation to enhance productivity and reduce operational costs.




    The primary growth factors propelling the SQL Generation AI market revolve around the escalating need for data-driven decision-making and the democratization of data access across organizations. As enterprises generate and store vast amounts of data, the ability to quickly and accurately extract actionable insights becomes critical. SQL Generation AI solutions, leveraging advanced machine learning and natural language processing algorithms, enable non-technical users to generate complex SQL queries using simple natural language instructions. This not only reduces the dependency on specialized database administrators but also accelerates the pace of business intelligence and analytics initiatives. The proliferation of self-service analytics and the integration of AI-powered query generation into popular business intelligence platforms further amplify market growth, making it easier for organizations to unlock the value of their data assets.




    Another significant driver is the ongoing digital transformation across various industries, which has led to the modernization of legacy IT infrastructures and the adoption of cloud-based data management solutions. Organizations are increasingly migrating their databases to the cloud to benefit from scalability, flexibility, and cost-efficiency. SQL Generation AI tools are being integrated with cloud data warehouses and analytics platforms, allowing for seamless query generation and real-time data analysis. This shift not only optimizes data workflows but also supports hybrid and multi-cloud strategies, enabling enterprises to manage and analyze data across diverse environments. The rising volume and diversity of data, coupled with the need for real-time insights, are compelling organizations to invest in AI-powered SQL generation to maintain a competitive edge.




    Additionally, the COVID-19 pandemic has accelerated the adoption of digital technologies, including AI-driven SQL generation, as organizations seek to automate routine tasks and enhance operational resilience. The growing emphasis on remote work and distributed teams has highlighted the importance of intuitive data access and collaboration tools. SQL Generation AI solutions facilitate seamless collaboration between business users and data teams, bridging the gap between technical and non-technical stakeholders. This has led to increased demand across sectors such as BFSI, healthcare, retail, and manufacturing, where timely data insights are crucial for strategic decision-making. The market is also witnessing heightened interest from small and medium enterprises, which are leveraging AI-powered SQL generation to level the playing field with larger competitors.




    Regionally, North America continues to dominate the SQL Generation AI market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The presence of major technology vendors, early adoption of AI and cloud technologies, and a strong focus on data-driven innovation contribute to North America's leadership position. Europe is witnessing rapid growth, driven by stringent data regulations and increasing investments in digital transformation initiatives. Meanwhile, Asia Pacific is emerging as a high-growth region, fueled by expanding IT infrastructure, a burgeoning startup ecosystem, and rising demand for advanced analytics solutions in countries such as China, India, and Japan. Latin America and the Middle East & Africa are also showing promising growth potential as organizations in these regions accelerate their digital journeys.



    Component Analysis



    The SQL Generation AI market by component is broadly segmented into Software and Services. The software segment commands the majority market share, as organizations increasingly dep

  9. Bike Warehouse SQL Project

    • kaggle.com
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Safae Ahb (2025). Bike Warehouse SQL Project [Dataset]. https://www.kaggle.com/datasets/safaeahb/bike-warehouse-sql-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Safae Ahb
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SAP Bikes Sales : SQL Project

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fdd8e395e5d70bde9279f0f653b4bc2bf%2FGemini_Generated_Image_cvz71ncvz71ncvz7.jpg?generation=1736783649344014&alt=media" alt=""> This project involves analyzing and transforming data from a bike warehouse database using SQL. The goal is to clean, transform, and query the data to generate insights about products, employees, customers, sales, and trends.

    Overview

    The SAP Bikes Sales database contains various tables that represent business data for a bike warehouse, such as information on products, sales, employees, business partners, and more. This project focuses on cleaning and transforming data, optimizing database schema, and generating SQL queries to gain business insights.

    Key SQL Operations:

    1.**Data Cleaning & Transformation**: - Remove duplicate records from key tables. - Drop unnecessary columns and handle null values. - Populate new columns based on existing data. - Merge related tables to create new insights. 2.**Business Insights Queries**: - Top-selling Products: Identify products with the highest sales quantities and total revenue. - Sales Performance by Product Category: Analyze revenue and order counts by product category. - Employee Sales Performance: Track employees' contribution to sales volumes and revenue. - Customer Segmentation: Examine the number of orders placed by business partners and their total sales value. - Sales Trends: Analyze sales trends over time and calculate average order values.

    Tables Involved

    • Addresses: Contains information about addresses: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F1a5b39b4f402dfce31ea25d6d53c2f38%2FAdresses%20Table.PNG?generation=1736780543250265&alt=media" alt="">
    • BusinessPartners: Contains details about business partners: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F416a9be40526650a4167dfdc565dfbe6%2FBusinessPartners%20Table.PNG?generation=1736780656503685&alt=media" alt="">
    • Employees: Contains employee information: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F5b99849bde2bc052cc1d6cc7d52fb67d%2FEmployees%20Table.PNG?generation=1736780677194831&alt=media" alt="">
    • ProductCategories & ProductCategoryText: Describe product categories and their descriptions: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F928f9aeb937c2fdc8d8860cc8d23f9d7%2FProductCategories%20Table.PNG?generation=1736780784495223&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fe148078e53777ca1180c5adf6cec7dda%2FProductCategory%20Text%20Table.PNG?generation=1736780831995071&alt=media" alt="">
    • Products & ProductTexts: Contain product details and product descriptions: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fdd4eb334332ec5d9248ccb8b737dd2df%2FProducts%20Table.PNG?generation=1736780894684724&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Faceb93b69951b1fde1f46bac146a9aa0%2FProductTexts%20Table.PNG?generation=1736782044055973&alt=media" alt="">
    • SalesOrderItems: Contains details of individual items within a sales order: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2Fe0ba42ddc00634ce1728e013dbeb231c%2FSalesOrderItemsTable.PNG?generation=1736781074515668&alt=media" alt="">
    • SalesOrders: Contains information about sales orders: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F22744129%2F0a67d6ba6ded143676db9f0b4f4dfcb0%2FSalesOrders%20Table.PNG?generation=1736781089531236&alt=media" alt="">

    Key SQL Queries

    1. Data Cleaning and Transformation:

    -**Addresses Table**: -Checking for duplicates ADDRESSID. -**BusinessPartners Table**: -Handled duplicates, missing or incorrect data. -Dropped the unnecessary FAXNUMBER column because it was empty. -**Employee Table**: -Dropped unnecessary columns. -Populated NAME_INITIALS based on employee's first, middle, and last name initials. -Fixed column type issues. -**Product Categories and Product Texts**: -Merged ProductCategories and ProductCategoryText tables into a new CombinedProductCategories table for easy analysis. -**Products Table**: -Dropped irrelevant columns such as WIDTH, DEPTH, HEIGHT, etc. -**Sales Order Items Table**: -Fixed null values in GROSSAMOUNT and created a TOTALGROSSAMOUNT column to track sales volume.

    ###2. Database Diagram and Relationships In addition to the data cleaning and analysis, a database diagram has been create...

  10. WikiSQL (Questions and SQL Queries)

    • kaggle.com
    zip
    Updated Nov 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dataset-for-developing-natural-language-interfac
    Explore at:
    zip(21491264 bytes)Available download formats
    Dataset updated
    Nov 25, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    WikiSQL (Questions and SQL Queries)

    80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

    By Huggingface Hub [source]

    About this dataset

    A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

    Research Ideas

    • This dataset can be used to develop natural language interfaces for relational databases.
    • This dataset can be used to develop a knowledge base of common SQL queries.
    • This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  11. h

    sql-create-context-v2

    • huggingface.co
    Updated Feb 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rama Chetan Atmudi (2024). sql-create-context-v2 [Dataset]. https://huggingface.co/datasets/ramachetan22/sql-create-context-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 25, 2024
    Authors
    Rama Chetan Atmudi
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    sql-create-context-v2 Dataset

      Overview
    

    The sql-create-context-v2 dataset enhances the original dataset built from WikiSQL and Spider, focusing on text-to-SQL tasks with a special emphasis on reducing hallucination of column and table names. This version introduces a JSONL format for more efficient data processing and iteration, alongside a structured approach to representing SQL queries in the dataset entries.

      Key Enhancements
    

    Dataset Format: Transitioned to… See the full description on the dataset page: https://huggingface.co/datasets/ramachetan22/sql-create-context-v2.

  12. Z

    Source Code Archiving to the Rescue of Reproducible Deployment — Replication...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Courtès, Ludovic; Sample, Timothy; Simon, Tournier; Zacchiroli, Stefano (2024). Source Code Archiving to the Rescue of Reproducible Deployment — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11243113
    Explore at:
    Dataset updated
    May 23, 2024
    Dataset provided by
    Institut Polytechnique de Paris
    Université Paris Cité
    Centre de Recherche Inria Bordeaux - Sud-Ouest
    Authors
    Courtès, Ludovic; Sample, Timothy; Simon, Tournier; Zacchiroli, Stefano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication package for the paper:

    Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli.Source Code Archiving to the Rescue of Reproducible DeploymentACM REP'24, June 18-20, 2024, Rennes, Francehttps://doi.org/10.1145/3641525.3663622

    Generating the paper

    The paper can be generated using the following command:

    guix time-machine -C channels.scm
    -- shell -C -m manifest.scm
    -- make

    This uses GNU Guix to run make in the exact same computational environment used when preparing the paper. The computational environment is described by two files. The channels.scm file specifies the exact version of the Guix package collection to use. The manifest.scm file selects a subset of those packages to include in the environment.

    It may be possible to generate the paper without Guix. To do so, you will need the following software (on top of a Unix-like environment):

    GNU Make

    SQLite 3

    GNU AWK

    Rubber

    Graphviz

    TeXLive

    Structure

    data/ contains the data examined in the paper

    scripts/ contains dedicated code for the paper

    logs/ contains logs generated during certain computations

    Preservation of Guix

    Some of the claims in the paper come from analyzing the Preservation of Guix (PoG) database as published on January 26, 2024. This database is the result of years of monitoring the extent to which the source code referenced by Guix packages is archived. This monitoring has been carried out by Timothy Sample who occasionally publishes reports on his personal website: https://ngyro.com/pog-reports/latest/. The database included in this package (data/pog.sql) was downloaded from https://ngyro.com/pog-reports/2024-01-26/pog.db and then exported to SQL format. In addition to the SQL file, the database schema is also included in this package as data/schema.sql.

    The database itself is largely the result of scripts, but also of manual adjustments (where necessary or convenient). The scripts are available at https://git.ngyro.com/preservation-of-guix/, which is preserved in the Software Heritage archive as well: https://archive.softwareheritage.org/swh:1:snp:efba3456a4aff0bc25b271e128aa8340ae2bc816;origin=https://git.ngyro.com/preservation-of-guix. These scripts rely on the availability of source code in certain locations on the Internet, and therefore will not yield exactly the same result when run again.

    Analysis

    Here is an overview of how we use the PoG database in the paper. The exact way it is queried to produce graphs and tables for the paper is laid out in the Makefile.

    The pog-types.sql query gives the counts of each source type (e.g. “git” or “tar-gz”) for each commit covered by the database.

    The pog-status.sql query gives the archival status of the sources by commit. For each commit, it produces a count of how many sources are stored in the Software Heritage archive, missing from it, or unknown if stored or missing. The pog-status-total.sql query does the same thing but over all sources without sorting them into individual commits.

    The disarchive-ratio.sql query estimates the success rate of Disarchive disassembly.

    Finally, the swhid-ratio.sql query gives the proportion of sources for which the PoG database has an SWHID.

    Estimating missing sources

    The Preservation of Guix database only covers sources from a sample of commits to the Guix repository. This greatly simplifies the process of collecting the sources at the risk of missing a few. We estimate how many are missed by searching Guix’s Git history for Nix-style base-32 hashes. The result of this search is compared to the hashes in the PoG database.

    A naïve search of Git history results in an over estimate due to Guix’s branch development model. We find hashes that were never exposed to users of ‘guix pull’. To work around this, we also approximate the history of commits available to ‘guix pull’. We do this by scraping push events from the guix-commits mailing list archives (data/guix-commits.mbox). Unfortunately, those archives are not quite complete. Missing history is reconstructed in the data/missing-links.txt file.

    This estimate requires a copy of the Guix Git repository (not included in this package). The repository can be obtained from GNU at https://git.savannah.gnu.org/git/guix.git or from the Software Heritage archive: https://archive.softwareheritage.org/swh:1:snp:9d7b8dcf5625c17e42d51357848baa226b70e4bb;origin=https://git.savannah.gnu.org/git/guix.git. Once obtained, its location must be specified in the Makefile.

    To generate the estimate, use:

    guix time-machine -C channels.scm
    -- shell -C -m manifest.scm
    -- make data/missing-sources.txt

    If not using Guix, you will need additional software beyond what is used to generate the paper:

    GNU Guile

    GNU Bash

    GNU Mailutils

    GNU Parallel

    Measuring link rot

    In order to measure link rot, we ran Guix Scheme scripts, i.e., scripts that exploit Guix as a Scheme library. The scripts depend on the state of world at the very specific moment when they ran. Hence, it is not possible to reproduce the exact same outputs. However, their tendency over the passing of time should be very similar. For running them, you need an installation of Guix. For instance,

    guix repl -q scripts/table-per-origin.scm

    When running these scripts for the paper, we tracked their output and saved it inside the logs directory.

  13. Global SQL In-Memory Database Market Size By Type (SQL, Relational data...

    • verifiedmarketresearch.com
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2023). Global SQL In-Memory Database Market Size By Type (SQL, Relational data type, NEWSQL), By Application (Reporting, Transaction, Analytics), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/sql-in-memory-database-market/
    Explore at:
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Verified Market Researchhttps://www.verifiedmarketresearch.com/
    Authors
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2026 - 2032
    Area covered
    Global
    Description

    SQL In-Memory Database Market size was valued at USD 9.26 Billion in 2024 and is projected to reach USD 35.7 Billion by 2032, growing at a CAGR of 20.27% from 2026 to 2032.

    SQL In-Memory Database Market Drivers

    Demand for Real-Time Analytics and Processing: Businesses increasingly require real-time insights from their data to make faster and more informed decisions. SQL In-Memory databases excel at processing data much faster than traditional disk-based databases, enabling real-time analytics and operational dashboards.

    Growth of Big Data and IoT Applications: The rise of Big Data and the Internet of Things (IoT) generates massive amounts of data that needs to be processed quickly. SQL In-Memory databases can handle these high-velocity data streams efficiently due to their in-memory architecture.

    Improved Performance for Transaction Processing Systems (TPS): In-memory databases offer significantly faster query processing times compared to traditional databases. This translates to improved performance for transaction-intensive applications like online banking, e-commerce platforms, and stock trading systems.

    Reduced Hardware Costs (in some cases): While implementing an in-memory database might require an initial investment in additional RAM, it can potentially reduce reliance on expensive high-performance storage solutions in specific scenarios.

    Focus on User Experience and Application Responsiveness: In today's digital landscape, fast and responsive applications are crucial. SQL In-Memory databases contribute to a smoother user experience by enabling quicker data retrieval and transaction processing.

    However, it's important to consider some factors that might influence market dynamics:

    Limited Data Capacity: In-memory databases are typically limited by the amount of available RAM, making them less suitable for storing massive datasets compared to traditional disk-based solutions.

    Higher Implementation Costs: Setting up and maintaining an in-memory database can be more expensive due to the additional RAM requirements compared to traditional databases.

    Hybrid Solutions: Many organizations opt for hybrid database solutions that combine in-memory and disk-based storage, leveraging the strengths of both for different data sets and applications.

  14. G

    SQL-Verified Generation Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). SQL-Verified Generation Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sql-verified-generation-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL-Verified Generation Market Outlook



    According to our latest research, the global SQL-Verified Generation market size reached USD 2.18 billion in 2024, driven by the increasing demand for data integrity, compliance, and advanced analytics across industries. The market is expected to grow at a robust CAGR of 13.5% from 2025 to 2033, reaching a projected value of USD 6.47 billion by 2033. This growth is primarily attributed to the surge in data-driven decision-making, the proliferation of digital transformation initiatives, and the rising regulatory requirements for data validation and security worldwide. As organizations continue to prioritize accurate, reliable, and compliant data operations, the adoption of SQL-Verified Generation solutions is set to accelerate steadily over the forecast period.



    One of the primary growth factors propelling the SQL-Verified Generation market is the exponential increase in data volumes generated by enterprises across various sectors. As organizations collect and process vast amounts of structured and unstructured data, the need for robust SQL-based verification tools becomes critical to ensure data accuracy, consistency, and reliability. These solutions enable businesses to automate data validation processes, minimize human errors, and enhance the overall quality of business intelligence outputs. Furthermore, the integration of artificial intelligence and machine learning with SQL-Verified Generation platforms is enabling advanced analytics and real-time data validation, further amplifying market growth. The ongoing digital transformation across industries such as BFSI, healthcare, and retail is pushing organizations to invest in scalable and secure data management solutions, thereby fueling the demand for SQL-Verified Generation systems.



    Another significant driver for market expansion is the stringent regulatory landscape governing data privacy, security, and compliance. With regulations such as GDPR, HIPAA, and CCPA becoming more rigorous, organizations are under increasing pressure to ensure that their data processing and storage practices are fully compliant. SQL-Verified Generation solutions play a vital role in automating compliance checks, auditing data flows, and providing transparent reporting mechanisms. These capabilities not only help organizations avoid hefty fines and reputational damage but also build trust with stakeholders and customers. The growing emphasis on data governance and risk management is prompting enterprises to adopt advanced SQL-Verified Generation tools that offer comprehensive compliance and auditing functionalities, thereby contributing to sustained market growth.



    Technological advancements and the shift towards cloud-based infrastructures are also catalyzing the adoption of SQL-Verified Generation solutions. Cloud deployment offers unparalleled scalability, flexibility, and cost-effectiveness, making it an attractive option for organizations of all sizes. The increasing integration of SQL-Verified Generation tools with cloud-based platforms and enterprise resource planning (ERP) systems is streamlining data integration, validation, and reporting processes. This seamless integration is not only enhancing operational efficiency but also enabling organizations to leverage real-time insights for strategic decision-making. As cloud adoption continues to rise, especially among small and medium enterprises, the market for SQL-Verified Generation solutions is poised for significant expansion in the coming years.



    From a regional perspective, North America and Europe are currently leading the SQL-Verified Generation market, driven by the presence of major technology providers, high digital adoption rates, and strict regulatory frameworks. However, the Asia Pacific region is emerging as a lucrative market, fueled by rapid digitalization, increasing investments in IT infrastructure, and a growing focus on data-driven business strategies. Countries such as China, India, and Japan are witnessing substantial growth in sectors like BFSI, healthcare, and e-commerce, which is boosting the demand for advanced data validation and compliance solutions. As organizations in emerging economies continue to embrace digital transformation, the SQL-Verified Generation market is expected to witness robust growth across all major regions.



  15. D

    In-Vehicle SQL Query Engine Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). In-Vehicle SQL Query Engine Market Research Report 2033 [Dataset]. https://dataintelo.com/report/in-vehicle-sql-query-engine-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    In-Vehicle SQL Query Engine Market Outlook



    According to our latest research, the global in-vehicle SQL query engine market size reached USD 1.29 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 14.7% from 2025 to 2033, reaching an estimated USD 4.14 billion by 2033. This impressive growth is primarily driven by the automotive sector’s increasing adoption of real-time data management and analytics solutions, which are critical for modern connected vehicles and advanced driver-assistance systems. As per our latest research, the demand for in-vehicle SQL query engines is being propelled by the integration of advanced infotainment, safety, and telematics features that require seamless, high-speed data querying capabilities within vehicle architectures.




    One of the most significant growth factors for the in-vehicle SQL query engine market is the rapid digital transformation within the automotive industry. The proliferation of connected vehicles has led to an exponential increase in the volume and complexity of data generated by onboard sensors, infotainment units, and telematics systems. Automakers and OEMs are increasingly leveraging SQL query engines to process and analyze this data in real time, enabling enhanced vehicle performance, predictive maintenance, and personalized user experiences. The shift towards software-defined vehicles and the rising implementation of edge computing within vehicles further underscore the necessity for robust, scalable, and efficient SQL query engines that can handle diverse data streams and deliver actionable insights with minimal latency.




    Another key driver is the growing emphasis on safety, security, and regulatory compliance across global automotive markets. Advanced driver-assistance systems (ADAS), telematics, and fleet management platforms rely heavily on accurate and timely data retrieval to ensure optimal operation and compliance with evolving safety standards. SQL query engines provide the backbone for these mission-critical applications by facilitating complex queries and aggregating data from multiple subsystems in real time. Furthermore, as vehicles become increasingly autonomous and connected, the need for secure, reliable, and high-performance data management solutions is intensifying, pushing automotive stakeholders to invest in advanced in-vehicle SQL query engine technologies.




    The expanding ecosystem of electric vehicles (EVs) and the rise of mobility-as-a-service (MaaS) platforms are also catalyzing the growth of the in-vehicle SQL query engine market. EVs, with their sophisticated battery management systems and connected infrastructure, generate vast amounts of operational and diagnostic data that require efficient querying and analysis. Similarly, fleet operators and mobility service providers are adopting SQL query engines to optimize route planning, monitor vehicle health, and enhance customer satisfaction. These evolving market dynamics are expected to drive sustained demand for in-vehicle SQL query engines across both passenger and commercial vehicle segments, fostering innovation and competition among technology vendors.




    From a regional perspective, Asia Pacific is emerging as a dominant force in the in-vehicle SQL query engine market, fueled by the rapid adoption of connected vehicle technologies in China, Japan, and South Korea. North America and Europe continue to be significant markets, driven by early technological adoption, stringent safety regulations, and the presence of leading automotive OEMs. The Middle East & Africa and Latin America are also witnessing steady growth, supported by increasing investments in smart mobility and telematics infrastructure. The regional outlook remains highly promising, with each geography contributing uniquely to the global expansion of the in-vehicle SQL query engine market.



    Component Analysis



    The component segment of the in-vehicle SQL query engine market is broadly categorized into software, hardware, and services. The software sub-segment holds the largest market share, driven by the rising demand for advanced query processing, analytics, and data management solutions within vehicle ecosystems. Automotive software platforms are evolving to support complex SQL queries that enable real-time data aggregation, event detection, and predictive analytics. Vendors are focusing on developing lig

  16. h

    sql-create-context-thai

    • huggingface.co
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saksorn Ruangtanusak (2024). sql-create-context-thai [Dataset]. https://huggingface.co/datasets/saksornr/sql-create-context-thai
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2024
    Authors
    Saksorn Ruangtanusak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset builds from sql-create-context. @misc{b-mc2_2023_sql-create-context, title = {sql-create-context Dataset}, author = {b-mc2}, year = {2023}, url = {https://huggingface.co/datasets/b-mc2/sql-create-context}, note = {This dataset was created by modifying data from the following sources: \cite{zhongSeq2SQL2017, yu2018spider}.}, }

  17. H

    Current Population Survey (CPS)

    • dataverse.harvard.edu
    • search.dataone.org
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  18. Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  19. h

    Data from: text-to-sql

    • huggingface.co
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mir Mudasir (2025). text-to-sql [Dataset]. https://huggingface.co/datasets/Mudasir692/text-to-sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2025
    Authors
    Mir Mudasir
    Description

    This dataset is a merged collection of multiple text-to-SQL datasets, designed to provide a comprehensive resource for training and evaluating text-to-SQL models. It combines data from several popular benchmarks, including Spider, CoSQL, SparC, and others, to create a diverse and robust dataset for natural language to SQL query generation tasks. Dataset Details Dataset Description Curated by: Mudasir Ahmad Mir Language(s) (NLP): English License: Apache 2.0 This dataset is ideal for researchers… See the full description on the dataset page: https://huggingface.co/datasets/Mudasir692/text-to-sql.

  20. Z

    PLBD (Protein Ligand Binding Database) table description XML file

    • data.niaid.nih.gov
    Updated Dec 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lingė, Darius; Gedgaudas, Marius; Merkys, Andrius; Petrauskas, Vytautas; Vaitkus, Antanas; Grybauskas, Algirdas; Paketurytė, Vaida; Zubrienė, Asta; Zakšauskas, Audrius; Mickevičiūtė, Aurelija; Smirnovienė, Joana; Baranauskienė, Lina; Čapkauskaitė, Edita; Dudutienė, Virginija; Urniežius, Ernestas; Konovalovas, Aleksandras; Kazlauskas, Egidijus; Gražulis, Saulius; Matulis, Daumantas (2022). PLBD (Protein Ligand Binding Database) table description XML file [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7482007
    Explore at:
    Dataset updated
    Dec 26, 2022
    Dataset provided by
    Institute of Biotechnology, Life Sciences Center, Vilnius University
    Authors
    Lingė, Darius; Gedgaudas, Marius; Merkys, Andrius; Petrauskas, Vytautas; Vaitkus, Antanas; Grybauskas, Algirdas; Paketurytė, Vaida; Zubrienė, Asta; Zakšauskas, Audrius; Mickevičiūtė, Aurelija; Smirnovienė, Joana; Baranauskienė, Lina; Čapkauskaitė, Edita; Dudutienė, Virginija; Urniežius, Ernestas; Konovalovas, Aleksandras; Kazlauskas, Egidijus; Gražulis, Saulius; Matulis, Daumantas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PLBD (Protein Ligand Binding Database) table description XML file

    General

    The provided ZIP archive contains an XML file "main-database-description.xml" with the description of all tables (VIEWS) that are exposed publicly at the PLBD server (https://plbd.org/). In the XML file, all columns of the visible tables are described, specifying their SQL types, measurement units, semantics, calculation formulae, SQL statements that can be used to generate values in these columns, and publications of the formulae derivations.

    The XML file conforms to the published XSD schema created for descriptions of relational databases for specifications of scientific measurement data. The XSD schema ("relational-database_v2.0.0-rc.18.xsd") and all included sub-schemas are provided in the same archive for convenience. All XSD schemas are validated against the "XMLSchema.xsd" schema from the W3C consortium.

    The ZIP file contains the excerpt from the files hosted in the https://plbd.org/ at the moment of submission of the PLBD database in the Scientific Data journal, and is provided to conform the journal policies. The current data and schemas should be fetched from the published URIs:

    https://plbd.org/
    https://plbd.org/doc/db/schemas
    https://plbd.org/doc/xml/schemas
    

    Software that is used to generate SQL schemas, RestfulDB metadata and the RestfulDB middleware that allows to publish the databases generated from the XML description on the Web are available at public Subversion repositories:

    svn://www.crystallography.net/solsa-database-scripts
    svn://saulius-grazulis.lt/restfuldb
    

    Usage

    The unpacked ZIP file will create the "db/" directory with the tree layout given below. In addition to the database description file "main-database-description.xml", all XSD schemas necessary for validation of the XML file are provided. On a GNU/Linux operating system with a GNU Make package installed, the XML file validity can be checked by unpacking the ZIP file, entering the unpacked directory, and running 'make distclean; make'. For example, on a Linux Mint distribution, the following commands should work:

    unzip main-database-description.zip
    cd db/release/v0.10.0/tables/
    sh -x dependencies/Linuxmint-20.1/install.sh
    make distclean
    make
    

    If necessary, additional packages can be installed using the 'install.sh' script in the 'dependencies/' subdirectory corresponding to your operating system. As of the moment of writing, Debian-10 and Linuxmint-20.1 OSes are supported out of the box; similar OSes might work with the same 'install.sh' scripts. The installation scripts require to run package installation command under system administrator privileges, but they use only the standard system package manager, thus they should not put your system at risk. For validation and syntax checking, the 'rxp' and 'xmllint' programs are used.

    The log files provided in the "outputs/validation" subdirectory contain validation logs obtained on the system where the XML files were last checked and should indicate validity of the provided XML file against the references schemas.

    Layout of the archived file tree

    db/
    └── release
      └── v0.10.0
        └── tables
          ├── Makeconfig-validate-xml
          ├── Makefile
          ├── Makelocal-validate-xml
          ├── dependencies
          ├── main-database-description.xml
          ├── outputs
          └── schema
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ignacio Crespo; Adrián Campazas (2022). SQL Injection Attack Netflow [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6907251

Data from: SQL Injection Attack Netflow

Related Article
Explore at:
Dataset updated
Sep 28, 2022
Authors
Ignacio Crespo; Adrián Campazas
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This datasets have SQL injection attacks (SLQIA) as malicious Netflow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLMAP tool has been used.

NetFlow traffic has generated using DOROTHEA (DOcker-based fRamework fOr gaTHering nEtflow trAffic). NetFlow is a network protocol developed by Cisco for the collection and monitoring of network traffic flow data generated. A flow is defined as a unidirectional sequence of packets with some common properties that pass through a network device.

Datasets

The firts dataset was colleted to train the detection models (D1) and other collected using different attacks than those used in training to test the models and ensure their generalization (D2).

The datasets contain both benign and malicious traffic. All collected datasets are balanced.

The version of NetFlow used to build the datasets is 5.

    Dataset
    Aim
    Samples
    Benign-malicious
    traffic ratio




    D1
    Training
    400,003
    50%


    D2
    Test
    57,239
    50%

Infrastructure and implementation

Two sets of flow data were collected with DOROTHEA. DOROTHEA is a Docker-based framework for NetFlow data collection. It allows you to build interconnected virtual networks to generate and collect flow data using the NetFlow protocol. In DOROTHEA, network traffic packets are sent to a NetFlow generator that has a sensor ipt_netflow installed. The sensor consists of a module for the Linux kernel using Iptables, which processes the packets and converts them to NetFlow flows.

DOROTHEA is configured to use Netflow V5 and export the flow after it is inactive for 15 seconds or after the flow is active for 1800 seconds (30 minutes)

Benign traffic generation nodes simulate network traffic generated by real users, performing tasks such as searching in web browsers, sending emails, or establishing Secure Shell (SSH) connections. Such tasks run as Python scripts. Users may customize them or even incorporate their own. The network traffic is managed by a gateway that performs two main tasks. On the one hand, it routes packets to the Internet. On the other hand, it sends it to a NetFlow data generation node (this process is carried out similarly to packets received from the Internet).

The malicious traffic collected (SQLI attacks) was performed using SQLMAP. SQLMAP is a penetration tool used to automate the process of detecting and exploiting SQL injection vulnerabilities.

The attacks were executed on 16 nodes and launch SQLMAP with the parameters of the following table.

    Parameters
    Description




    '--banner','--current-user','--current-db','--hostname','--is-dba','--users','--passwords','--privileges','--roles','--dbs','--tables','--columns','--schema','--count','--dump','--comments', --schema'
    Enumerate users, password hashes, privileges, roles, databases, tables and columns


    --level=5
    Increase the probability of a false positive identification


    --risk=3
    Increase the probability of extracting data


    --random-agent
    Select the User-Agent randomly


    --batch
    Never ask for user input, use the default behavior


    --answers="follow=Y"
    Predefined answers to yes

Every node executed SQLIA on 200 victim nodes. The victim nodes had deployed a web form vulnerable to Union-type injection attacks, which was connected to the MYSQL or SQLServer database engines (50% of the victim nodes deployed MySQL and the other 50% deployed SQLServer).

The web service was accessible from ports 443 and 80, which are the ports typically used to deploy web services. The IP address space was 182.168.1.1/24 for the benign and malicious traffic-generating nodes. For victim nodes, the address space was 126.52.30.0/24. The malicious traffic in the test sets was collected under different conditions. For D1, SQLIA was performed using Union attacks on the MySQL and SQLServer databases.

However, for D2, BlindSQL SQLIAs were performed against the web form connected to a PostgreSQL database. The IP address spaces of the networks were also different from those of D1. In D2, the IP address space was 152.148.48.1/24 for benign and malicious traffic generating nodes and 140.30.20.1/24 for victim nodes.

To run the MySQL server we ran MariaDB version 10.4.12. Microsoft SQL Server 2017 Express and PostgreSQL version 13 were used.

Search
Clear search
Close search
Google apps
Main menu