24 datasets found
  1. WikiSQL (Questions and SQL Queries)

    • kaggle.com
    zip
    Updated Nov 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dataset-for-developing-natural-language-interfac
    Explore at:
    zip(21491264 bytes)Available download formats
    Dataset updated
    Nov 25, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    WikiSQL (Questions and SQL Queries)

    80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

    By Huggingface Hub [source]

    About this dataset

    A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

    Research Ideas

    • This dataset can be used to develop natural language interfaces for relational databases.
    • This dataset can be used to develop a knowledge base of common SQL queries.
    • This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  2. Text2SQL Dataset

    • kaggle.com
    zip
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarakanta Acharya (2025). Text2SQL Dataset [Dataset]. https://www.kaggle.com/datasets/tarakantaacharya/text2sql-dataset
    Explore at:
    zip(12165 bytes)Available download formats
    Dataset updated
    Mar 30, 2025
    Authors
    Tarakanta Acharya
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description

    🔍 Overview

    This dataset is built for Text-to-SQL (NL → SQL) tasks, helping train models to convert natural language into SQL queries. It is ideal for fine-tuning LLMs, developing AI-powered database assistants, and improving SQL query generation accuracy.

    📂 Dataset Structure

    Each row contains the following fields:
    - 📝 Instruction – A natural language query (e.g., "Find all customers who placed an order in the last 30 days.")
    - 📊 Query – The corresponding SQL statement (e.g., SELECT * FROM orders WHERE order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY);)
    - 🗄️ Database – Contains metadata such as:
    - Table Names – The relevant tables for the query (e.g., orders, customers)
    - Column Names – The specific fields used in the query (e.g., order_date, customer_id)

    🚀 Use Cases

    • Fine-tuning Large Language Models (LLMs) for SQL generation
    • Training AI chatbots to assist with SQL query building
    • Developing database assistants for automated SQL generation
    • Enhancing Retrieval-Augmented Generation (RAG) for SQL-based applications
  3. D

    SQL Query Optimization With AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). SQL Query Optimization With AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/sql-query-optimization-with-ai-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL Query Optimization with AI Market Outlook



    According to our latest research, the SQL Query Optimization with AI market size reached USD 1.32 billion in 2024, propelled by the rapid adoption of artificial intelligence in database management and analytics. The market is projected to grow at a robust CAGR of 22.1% from 2025 to 2033, reaching a forecasted value of USD 9.85 billion by 2033. This remarkable growth is primarily driven by the increasing need for real-time data processing, the proliferation of complex data environments, and the demand for enhanced application performance across industries.




    The surge in digital transformation initiatives across various sectors is one of the most significant growth factors for the SQL Query Optimization with AI market. Enterprises are increasingly relying on data-driven decision-making, which necessitates efficient and scalable database systems. AI-powered SQL query optimization tools help organizations streamline query execution, reduce latency, and maximize resource utilization. With the explosion of big data and the adoption of cloud-based infrastructures, businesses are seeking advanced solutions that can automate query tuning, detect anomalies, and dynamically adapt to changing workloads. The integration of machine learning algorithms into SQL optimization processes is enabling predictive analytics, self-healing databases, and automated performance tuning, further fueling market expansion.




    Another key driver is the escalating complexity of enterprise data ecosystems. Organizations today manage vast volumes of structured and unstructured data from multiple sources, including IoT devices, transactional systems, and external APIs. As data environments grow more intricate, manual query optimization becomes increasingly impractical and error-prone. AI-driven SQL optimization platforms address these challenges by continuously monitoring query performance, identifying bottlenecks, and suggesting optimal execution plans. This not only improves database efficiency but also reduces the burden on database administrators, allowing them to focus on higher-value tasks. The growing adoption of hybrid and multi-cloud strategies is also contributing to the demand for intelligent query optimization solutions that ensure consistent performance across diverse environments.




    Furthermore, the rise of regulatory compliance requirements and data privacy concerns is pushing organizations to invest in advanced database management solutions. AI-powered SQL query optimization tools can help ensure data integrity, minimize risks, and maintain compliance with industry standards such as GDPR, HIPAA, and PCI DSS. By automating query auditing, access control, and anomaly detection, these solutions enhance security and transparency in data operations. The increasing emphasis on customer experience, operational agility, and cost optimization is prompting enterprises to adopt AI-enabled query optimization as a strategic differentiator, driving sustained growth in the market.




    From a regional perspective, North America currently dominates the SQL Query Optimization with AI market, accounting for the largest revenue share due to the presence of leading technology vendors, early adoption of AI, and a mature IT infrastructure. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid digitalization, expanding cloud adoption, and the emergence of data-centric business models in countries like China, India, and Japan. Europe is also experiencing steady growth, fueled by stringent data protection regulations and increasing investments in AI-driven database management solutions. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, supported by government initiatives to promote digital transformation and the growing penetration of cloud services.



    Component Analysis



    The Component segment of the SQL Query Optimization with AI market is categorized into Software, Hardware, and Services. Software solutions represent the largest share of the market, as they form the backbone of AI-driven query optimization processes. These include advanced query analyzers, AI-powered database management platforms, and automated performance tuning tools that leverage machine learning algorithms to optimize SQL queries in real time. The proliferation of open-source frameworks and the integration of AI capabilities into existing database manage

  4. G

    SQL Acceleration Engine Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). SQL Acceleration Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sql-acceleration-engine-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL Acceleration Engine Market Outlook



    According to our latest research, the global SQL Acceleration Engine market size reached USD 2.3 billion in 2024, exhibiting robust momentum driven by the rapid digital transformation across industries. The market is set to expand at a CAGR of 16.2% from 2025 to 2033, propelling the total market value to approximately USD 9.4 billion by 2033. This remarkable growth is primarily fueled by the escalating demand for real-time analytics, the proliferation of big data, and the increasing adoption of cloud-based solutions. As per our latest research, organizations worldwide are prioritizing data-driven decision-making, thereby accelerating investments in advanced SQL acceleration engines to optimize database performance and reduce query latency.




    The primary growth factor underpinning the SQL Acceleration Engine market is the exponential increase in data generation from diverse sources such as IoT devices, social media, enterprise applications, and e-commerce platforms. Enterprises are grappling with the challenge of processing and analyzing massive volumes of structured and unstructured data efficiently. SQL acceleration engines play a pivotal role in enhancing the speed and efficiency of SQL queries, which is critical for delivering timely insights and maintaining a competitive edge. This surge in data-centric operations has compelled organizations to seek advanced solutions capable of handling complex queries and large datasets, thereby driving market expansion.




    Another significant driver is the widespread adoption of cloud computing across various sectors. Cloud-based SQL acceleration engines offer scalability, flexibility, and cost-effectiveness, enabling organizations to seamlessly manage fluctuating workloads and data volumes. The shift towards hybrid and multi-cloud environments further amplifies the need for advanced SQL acceleration solutions that can ensure high performance and low latency regardless of deployment architecture. Additionally, the integration of artificial intelligence and machine learning into SQL acceleration engines is enhancing their capabilities, allowing for automated query optimization and intelligent workload management, which further propels market growth.




    The increasing focus on real-time analytics and business intelligence is also contributing to the market’s robust growth trajectory. Modern enterprises require instant access to actionable insights to make informed decisions, streamline operations, and enhance customer experiences. SQL acceleration engines enable rapid query processing, facilitating real-time data analysis and visualization. This is particularly crucial in sectors such as BFSI, healthcare, and retail, where timely insights can significantly impact business outcomes. Furthermore, the growing emphasis on digital transformation and the adoption of advanced analytics tools are expected to sustain the demand for SQL acceleration engines in the foreseeable future.




    From a regional perspective, North America dominates the SQL Acceleration Engine market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The region’s leadership is attributed to the presence of major technology providers, early adoption of advanced database solutions, and substantial investments in cloud infrastructure. Asia Pacific, on the other hand, is witnessing the fastest growth, driven by the rapid digitization of enterprises, expanding IT sector, and increasing adoption of cloud-based analytics solutions. Meanwhile, Europe continues to demonstrate steady growth, supported by stringent data regulations and a strong focus on data-driven innovation across industries.





    Component Analysis



    The SQL Acceleration Engine market is segmented by component into software, hardware, and services, each playing a distinct role in the overall ecosystem. The software segment holds the largest share, driven by the continuous innovation in SQL query optimization algorithms and the integration of advanced analytics

  5. r

    Specification and optimization of analytical data flows

    • resodate.org
    Updated May 27, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabian Hüske (2016). Specification and optimization of analytical data flows [Dataset]. http://doi.org/10.14279/depositonce-5150
    Explore at:
    Dataset updated
    May 27, 2016
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Fabian Hüske
    Description

    In the past, the majority of data analysis use cases was addressed by aggregating relational data. Since a few years, a trend is evolving, which is called “Big Data” and which has several implications on the field of data analysis. Compared to previous applications, much larger data sets are analyzed using more elaborate and diverse analysis methods such as information extraction techniques, data mining algorithms, and machine learning methods. At the same time, analysis applications include data sets with less or even no structure at all. This evolution has implications on the requirements on data processing systems. Due to the growing size of data sets and the increasing computational complexity of advanced analysis methods, data must be processed in a massively parallel fashion. The large number and diversity of data analysis techniques as well as the lack of data structure determine the use of user-defined functions and data types. Many traditional database systems are not flexible enough to satisfy these requirements. Hence, there is a need for programming abstractions to define and efficiently execute complex parallel data analysis programs that support custom user-defined operations. The success of the SQL query language has shown the advantages of declarative query specification, such as potential for optimization and ease of use. Today, most relational database management systems feature a query optimizer that compiles declarative queries into physical execution plans. Cost-based optimizers choose from billions of plan candidates the plan with the least estimated cost. However, traditional optimization techniques cannot be readily integrated into systems that aim to support novel data analysis use cases. For example, the use of user-defined functions (UDFs) can significantly limit the optimization potential of data analysis programs. Furthermore, lack of detailed data statistics is common when large amounts of unstructured data is analyzed. This leads to imprecise optimizer cost estimates, which can cause sub-optimal plan choices. In this thesis we address three challenges that arise in the context of specifying and optimizing data analysis programs. First, we propose a parallel programming model with declarative properties to specify data analysis tasks as data flow programs. In this model, data processing operators are composed of a system-provided second-order function and a user-defined first-order function. A cost-based optimizer compiles data flow programs specified in this abstraction into parallel data flows. The optimizer borrows techniques from relational optimizers and ports them to the domain of general-purpose parallel programming models. Second, we propose an approach to enhance the optimization of data flow programs that include UDF operators with unknown semantics. We identify operator properties and conditions to reorder neighboring UDF operators without changing the semantics of the program. We show how to automatically extract these properties from UDF operators by leveraging static code analysis techniques. Our approach is able to emulate relational optimizations such as filter and join reordering and holistic aggregation push-down while not being limited to relational operators. Finally, we analyze the impact of changing execution conditions such as varying predicate selectivities and memory budgets on the performance of relational query plans. We identify plan patterns that cause significantly varying execution performance for changing execution conditions. Plans that include such risky patterns are prone to cause problems in presence of imprecise optimizer estimates. Based on our findings, we introduce an approach to avoid risky plan choices. Moreover, we present a method to assess the risk of a query execution plan using a machine-learned prediction model. Experiments show that the prediction model outperforms risk predictions which are computed from optimizer estimates.

  6. S

    Structured Query Language Server Transformation Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Structured Query Language Server Transformation Report [Dataset]. https://www.marketreportanalytics.com/reports/structured-query-language-server-transformation-57123
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Apr 3, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Structured Query Language (SQL) server transformation market is experiencing robust growth, projected to reach $15 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 9.4% from 2025 to 2033. This expansion is fueled by several key drivers. The increasing adoption of cloud-based solutions and the rise of big data analytics are pushing organizations to adopt more efficient and scalable SQL server solutions. Furthermore, the growing demand for real-time data processing and improved data integration capabilities within large enterprises and SMEs is significantly driving market growth. The market segmentation reveals strong demand across various application areas, with large enterprises leading the way due to their greater need for robust and scalable data management infrastructure. Data integration scripts remain a prominent segment, highlighting the critical need for seamless data flow across diverse systems. The competitive landscape is marked by established players like Oracle, IBM, and Microsoft, alongside emerging innovative companies specializing in cloud-based SQL server technologies. Geographic analysis suggests North America and Europe currently hold the largest market share, but significant growth potential exists in the Asia-Pacific region, driven by rapid digital transformation and economic growth in countries like India and China. The restraints on market growth are primarily related to the complexities involved in migrating existing legacy systems to new SQL server solutions, along with the need for skilled professionals to manage and optimize these systems. However, the ongoing advancements in automation tools and the increased availability of training programs are mitigating these challenges. The future trajectory of the market indicates continued growth, driven by emerging technologies such as AI-powered query optimization, enhanced security features, and the growing adoption of serverless architectures. This will lead to a wider adoption of SQL server transformation across various sectors, including finance, healthcare, and retail, as organizations seek to leverage data to gain competitive advantage and improve operational efficiency. The market is ripe for innovation and consolidation, with opportunities for both established players and new entrants to capitalize on this ongoing transformation.

  7. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  8. D

    SQL Performance Tuning Tools Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). SQL Performance Tuning Tools Market Research Report 2033 [Dataset]. https://dataintelo.com/report/sql-performance-tuning-tools-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL Performance Tuning Tools Market Outlook



    According to our latest research, the global SQL Performance Tuning Tools market size reached USD 1.42 billion in 2024, exhibiting robust expansion driven by the surging need for optimized database management and real-time analytics across enterprises. The market is poised to grow at a CAGR of 9.7% from 2025 to 2033, with the forecasted value expected to hit USD 3.27 billion by 2033. This growth is primarily attributed to the increasing complexity of database environments, the proliferation of data-driven applications, and the urgent demand for high availability and efficiency in mission-critical business operations. As organizations continue to digitize and scale their infrastructure, SQL performance tuning tools are becoming indispensable for ensuring seamless data processing and superior user experiences.




    A significant growth factor for the SQL Performance Tuning Tools market is the exponential increase in data volumes generated by organizations worldwide. Enterprises are embracing digital transformation initiatives, leading to a surge in transactional and analytical workloads that demand high-performing databases. SQL performance tuning tools play a pivotal role in identifying, diagnosing, and resolving performance bottlenecks within SQL queries and database configurations. With the adoption of advanced analytics, artificial intelligence, and machine learning, organizations are generating and processing more data than ever before, necessitating robust tools to ensure optimal database performance. This trend is particularly pronounced in sectors such as BFSI, healthcare, and e-commerce, where data-driven decision-making and real-time insights are critical for competitive advantage.




    Another key driver is the growing complexity of IT environments, particularly with the rise of hybrid and multi-cloud deployments. As enterprises migrate workloads to cloud platforms and integrate on-premises systems with cloud-based solutions, managing and tuning SQL databases becomes increasingly challenging. SQL performance tuning tools enable IT teams to monitor and optimize database performance across diverse and distributed environments, ensuring consistency, reliability, and scalability. These tools offer advanced features such as automated query optimization, real-time monitoring, and predictive analytics, which are essential for maintaining service-level agreements (SLAs) and minimizing downtime. The increasing reliance on cloud infrastructure, coupled with the need for agile and resilient database management, is expected to further propel market growth.




    The expanding ecosystem of database technologies and the proliferation of open-source SQL databases are also fueling demand for performance tuning solutions. Organizations are adopting a wide range of relational and non-relational databases to support diverse workloads, leading to greater heterogeneity in database environments. This diversity introduces new challenges in performance management, as traditional tuning methods may not be effective across different platforms. SQL performance tuning tools are evolving to support a broad spectrum of database engines, providing unified visibility and optimization capabilities. As businesses strive to deliver high-quality digital experiences and minimize operational costs, the adoption of advanced tuning tools is becoming a strategic imperative.




    From a regional perspective, North America continues to dominate the SQL Performance Tuning Tools market, accounting for the largest share in 2024. This leadership is driven by the presence of major technology vendors, a mature IT infrastructure, and early adoption of advanced database management solutions. Europe and Asia Pacific are also witnessing rapid growth, fueled by increasing investments in digital transformation, expanding IT services sectors, and the rise of cloud computing. The Asia Pacific region, in particular, is expected to exhibit the highest CAGR during the forecast period, supported by the proliferation of SMEs, growing e-commerce activities, and government initiatives to promote digital innovation. Meanwhile, Latin America and the Middle East & Africa are emerging as promising markets, albeit at a relatively nascent stage, as organizations in these regions modernize their IT landscapes and embrace data-driven strategies.



    Component Analysis



    The SQL Performance Tuning Tools market by component is broadly segmented into software and servi

  9. Wikipedia SQLITE Portable DB, Huge 5M+ Rows

    • kaggle.com
    zip
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
    Explore at:
    zip(6064169983 bytes)Available download formats
    Dataset updated
    Jun 29, 2024
    Authors
    christernyc
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

    I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

    Key Features:

    Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

    The database consists of four main tables:

    • items: Contains information about Wikipedia items, including labels and descriptions
    • properties: Stores details about Wikidata properties, such as labels and descriptions
    • pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts
    • link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

    This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

    https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

    Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

    Usage with LIKE queries: ``` import aiosqlite import asyncio

    class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

    async def _aenter_(self):
      self.conn = await aiosqlite.connect(self.db_file)
      return self
    
    async def _aexit_(self, exc_type, exc_val, exc_tb):
      await self.conn.close()
    
    async def search_pages_by_title(self, title):
      query = """
      SELECT pages.page_id, pages.item_id, pages.title, pages.views, 
          items.labels AS item_labels, items.description AS item_description,
          link_annotated_text.sections
      FROM pages 
      JOIN items ON pages.item_id = items.id
      JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id
      WHERE pages.title LIKE ?
      """
      async with self.conn.execute(query, (f"%{title}%",)) as cursor:
        return await cursor.fetchall()
    
    async def search_items_by_label_or_description(self, keyword):
      query = """
      SELECT id, labels, description 
      FROM items
      WHERE labels LIKE ? OR description LIKE ?
      """
      async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor:
        return await cursor.fetchall()
    
    async def search_items_by_label(self, label):
      query = """
      SELECT id, labels, description
      FROM items 
      WHERE labels LIKE ?
      """
      async with self.conn.execute(query, (f"%{label}%",)) as cursor:
        return await cursor.fetchall()
    
    async def search_properties_by_label_or_desc...
    
  10. G

    SQL Query Engine Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). SQL Query Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sql-query-engine-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL Query Engine Market Outlook



    According to our latest research, the global SQL Query Engine market size in 2024 stands at USD 3.84 billion, reflecting robust growth driven by the increasing demand for efficient data management and analytics solutions across industries. The market is projected to expand at a CAGR of 12.1% from 2025 to 2033, reaching an estimated value of USD 10.77 billion by the end of the forecast period. This remarkable growth is underpinned by the escalating volume of structured and unstructured data, the proliferation of cloud-based applications, and the widespread adoption of advanced analytics and business intelligence tools.



    One of the primary growth factors driving the SQL Query Engine market is the exponential increase in data generation from digital transformation initiatives, IoT devices, and enterprise applications. Organizations are increasingly relying on SQL query engines to extract actionable insights from vast datasets, enabling informed decision-making and operational efficiency. The integration of SQL engines with big data platforms and cloud environments further amplifies their utility, as businesses seek scalable and high-performance solutions that can seamlessly handle complex queries across distributed data sources. This trend is particularly pronounced in industries such as BFSI, healthcare, and retail, where real-time data analysis is critical for competitive advantage and regulatory compliance.



    Another significant driver is the rapid evolution of cloud computing and the migration of enterprise workloads to cloud platforms. Cloud-based SQL query engines offer flexibility, scalability, and cost-effectiveness, making them highly attractive to organizations looking to modernize their IT infrastructure. The ability to run SQL queries on cloud-native data warehouses and integrate with various analytics tools has democratized access to advanced data capabilities, even for small and medium enterprises. Furthermore, innovations in query optimization, parallel processing, and support for hybrid and multi-cloud deployments are fostering greater adoption of SQL query engines across diverse business environments.



    The market is also benefiting from the growing emphasis on business intelligence and data-driven decision-making. Enterprises are leveraging SQL query engines to power dashboards, generate real-time reports, and facilitate self-service analytics for non-technical users. Enhanced support for structured query language, improved user interfaces, and integration with visualization tools are making it easier for business users to interact with data, driving broader usage across organizations. Additionally, the rise of data integration and analytics as core business functions is pushing vendors to continuously innovate, offering advanced features such as in-memory processing, machine learning integration, and support for semi-structured data formats.



    Regionally, North America continues to dominate the SQL Query Engine market, accounting for the largest revenue share in 2024. This is attributed to the strong presence of technology giants, early adoption of cloud technologies, and a thriving ecosystem of data-driven enterprises. However, Asia Pacific is expected to exhibit the fastest growth during the forecast period, fueled by rapid digitalization, increasing investments in cloud infrastructure, and the emergence of new business models in countries such as China, India, and Japan. Europe, Latin America, and the Middle East & Africa are also witnessing steady growth, supported by regulatory mandates for data governance and the rising importance of analytics in public and private sectors.





    Component Analysis



    The SQL Query Engine market is segmented by component into Software and Services. The software segment commands a substantial share of the market, as enterprises increasingly invest in advanced query engines to enhance their data processing and analytics capabilities. Modern SQL query engine software offers robust features such as distributed query pro

  11. h

    synthetic_text_to_sql

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      synthetic_text_to_sql
    

    gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

    105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

  12. Z

    Data from: The Software Heritage License Dataset (2022 Edition)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus M. Gonzalez-Barahona; Sergio Montes-Leon; Gregorio Robles; Stefano Zacchiroli (2024). The Software Heritage License Dataset (2022 Edition) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200351
    Explore at:
    Dataset updated
    Jan 10, 2024
    Dataset provided by
    LTCI, Télécom Paris, Institut Polytechnique de Paris, Paris, France
    Universidad Rey Juan Carlos, Madrid, Spain
    Authors
    Jesus M. Gonzalez-Barahona; Sergio Montes-Leon; Gregorio Robles; Stefano Zacchiroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).

    In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

    Format

    The dataset is organized as follows:

    blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.

    The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:

    blobs/ is the root directory containing all license blobs

    8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:

    $ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007

    $ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02

    86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1

    One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):

    swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"

    blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs

    license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:

    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"

    where:

    SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2

    SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory

    NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).

    blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:

    SHA1: blob SHA1

    MIME_TYPE: blob MIME type, as detected by libmagic

    ENCODING: blob character encoding, as detected by libmagic

    LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)

    WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)

    SIZE: blob size in bytes

    blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:

    SHA1: blob SHA1

    LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)

    SCORE: confidence score in the result, as a decimal number between 0 and 100

    There may be zero or arbitrarily many lines for each blob.

    blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:

    sha1: blob SHA1

    licenses: output of scancode.api.get_licenses(..., min_score=0)

    copyrights: output of scancode.api.get_copyrights(...)

    There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.

    blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHIDURL, for example:

    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis

    Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.

    If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.

    blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHIDNUMBER, for example:

    swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260

    Two blobs are missing because the computation crashes:

    swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc

    This issue will be fixed in a future version of the dataset

    blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHIDEARLIEST_SWHIDEARLIEST_TSOCCURRENCES, where:

    SWHID: blob SWHID

    EARLIEST_SWHID: SWHID of the earliest known commit containing the blob

    EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer

    OCCURRENCES: number of known commits containing the blob

    replication-package.tar.gz: code and scripts used to produce the dataset

    licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.

    Changes since the 2021-03-23 dataset

    More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.

    Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.

    Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.

    blobs-nb-origins.csv.zst is added.

    blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.

    blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.

    blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.

    blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)

    blobs-scancode.ndjson.zst is added.

    Errata

    A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:

    pv blobs-fileinfo.csv.zst | zstdcat | grep -v ".tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12

    The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.

    Citation

    If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:

    [pdf, bib] Jesús M. González-Barahona, Sergio Raúl Montes León, Gregorio Robles, Stefano Zacchiroli. The software heritage license dataset (2022 edition). Empirical Software Engineering, Volume 28, Number 6, Article number 147 (2023).

    [pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

    References

    The dataset has been built using primarily the data sources described in the following papers:

    [pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.

    [pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.

    Errata (v2, 2024-01-09)

    licenses-annotated-sample.tar.gz: some comments not intended for publication were removed, and 4

  13. Portuguese Text2SQL database

    • kaggle.com
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo M. de Morais (2024). Portuguese Text2SQL database [Dataset]. https://www.kaggle.com/datasets/emdemor/portuguese-text2sql-database
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Eduardo M. de Morais
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, which was constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context.

    The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualized SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.

    Dataset Details

    • Total Examples: 78,577
    • Columns:
      • pergunta: The question in natural language.
      • contexto: The SQL CREATE TABLE statement that provides the necessary context to answer the question.
      • resposta: The SQL query that answers the question using the provided context.

    Translation Process

    The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model, ensuring that the natural language queries maintain the same meaning and context as the original English questions.

    Objective and Applications

    This dataset is ideal for training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial. It can be used to enhance model performance in text-to-SQL tasks, providing clear context and avoiding common hallucination errors.

    Original Projects

    @misc{b-mc2_2023_sql-create-context,
    title = {sql-create-context Dataset},
    author = {b-mc2},
    year = {2023},
    url = {https://huggingface.co/datasets/b-mc2/sql-create-context},
    note = {This dataset was created by modifying data from the following sources: \cite{zhongSeq2SQL2017, yu2018spider}.},
    }
    
    @article{zhongSeq2SQL2017,
    author = {Victor Zhong and Caiming Xiong and Richard Socher},
    title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning},
    journal = {CoRR},
    volume = {abs/1709.00103},
    year = {2017}
    }
    
    @article{yu2018spider,
    title = {Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task},
    author = {Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
    journal = {arXiv preprint arXiv:1809.08887},
    year = {2018}
    }
    
  14. Google Ads Transparency Center

    • console.cloud.google.com
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&hl=de (2023). Google Ads Transparency Center [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/google-ads-transparency-center?hl=de
    Explore at:
    Dataset updated
    Sep 6, 2023
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Description

    This dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .

  15. D

    Managed Presto Services Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Managed Presto Services Market Research Report 2033 [Dataset]. https://dataintelo.com/report/managed-presto-services-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Managed Presto Services Market Outlook



    According to our latest research, the global managed Presto services market size reached USD 1.37 billion in 2024, reflecting strong demand for scalable, high-performance data analytics solutions across industries. The market is expected to witness a robust compound annual growth rate (CAGR) of 19.2% from 2025 to 2033, driven by the increasing adoption of cloud-based analytics platforms, growing volumes of enterprise data, and the need for real-time business insights. By 2033, the managed Presto services market is projected to reach USD 5.94 billion, underscoring a transformative shift in how organizations leverage open-source query engines for big data analytics and business intelligence.




    Several key growth factors are propelling the managed Presto services market forward. The exponential rise in data generation, fueled by digital transformation, IoT proliferation, and the adoption of advanced analytics, has compelled organizations to seek more efficient, scalable, and cost-effective data processing solutions. Presto, as an open-source distributed SQL query engine, is increasingly favored for its ability to perform fast, interactive analytics on large datasets across diverse data sources. Managed Presto services further enhance this value proposition by providing enterprises with fully managed, optimized, and secure environments, reducing the operational burden on IT teams and accelerating time-to-insight. This shift is particularly pronounced among organizations lacking in-house expertise or resources to manage complex data infrastructure, making managed Presto services an attractive alternative.




    Another significant driver is the growing demand for cloud-native analytics solutions. As businesses migrate their data and analytics workloads to the cloud, managed Presto services offer seamless integration with major cloud platforms, ensuring high availability, scalability, and flexibility. The cloud deployment model enables organizations to dynamically scale resources based on demand, optimize costs, and benefit from continuous updates and security enhancements provided by managed service providers. This trend is further amplified by the increasing adoption of hybrid and multi-cloud strategies, as enterprises seek to avoid vendor lock-in and maintain agility in their data operations. The synergy between Presto's federated query capabilities and the cloud's elastic infrastructure is creating new opportunities for innovation and data-driven decision-making.




    The managed Presto services market also benefits from the rising importance of real-time analytics and business intelligence in driving competitive advantage. Organizations across industries are leveraging Presto's ability to query data where it resides, whether in data lakes, warehouses, or external sources, to derive actionable insights with minimal latency. Managed services providers are enhancing their offerings with advanced features such as automated scaling, intelligent workload management, integrated security, and comprehensive monitoring, further increasing the appeal of Presto-based solutions. These advancements are enabling enterprises to unlock the full potential of their data assets, improve operational efficiency, and respond swiftly to changing market dynamics.




    From a regional perspective, North America currently dominates the managed Presto services market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The region's leadership is attributed to the early adoption of advanced analytics technologies, a mature cloud ecosystem, and the presence of major technology vendors and hyperscale cloud providers. Europe is witnessing steady growth, driven by increasing investments in digital transformation and stringent data privacy regulations, while Asia Pacific is emerging as a high-growth market due to rapid digitalization, expanding IT infrastructure, and the proliferation of data-driven business models. Latin America and the Middle East & Africa are also expected to register notable growth rates over the forecast period, supported by rising awareness of big data analytics and government-led digital initiatives.



    Component Analysis



    The managed Presto services market is segmented by component into software and services, each playing a pivotal role in the market’s overall value proposition. The software segment encompasses the core Presto query engine, manage

  16. Long-Term Agricultural Research (LTAR) network - Meteorological Collection

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Long-Term Agricultural Research (LTAR) network - Meteorological Collection [Dataset]. https://catalog.data.gov/dataset/long-term-agricultural-research-ltar-network-meteorological-collection-7d719
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    The LTAR network maintains stations for standard meteorological measurements including, generally, air temperature and humidity, shortwave (solar) irradiance, longwave (thermal) radiation, wind speed and direction, barometric pressure, and precipitation. Many sites also have extensive comparable legacy datasets. The LTAR scientific community decided that these needed to be made available to the public using a single web source in a consistent manner. To that purpose, each site sent data on a regular schedule, as frequently as hourly, to the National Agricultural Library, which has developed a web service to provide the data to the public in tabular or graphical form. This archive of the LTAR legacy database exports contains meteorological data through April 30, 2021. For current meteorological data, visit the GeoEvent Meteorology Resources page, which provides tools and dashboards to view and access data from the 18 LTAR sites across the United States. Resources in this dataset:Resource Title: Meteorological data. File Name: ltar_archive_DB.zipResource Description: This is an export of the meteorological data collected by LTAR sites and ingested by the NAL LTAR application. This export consists of an SQL schema definition file for creating database tables and the data itself. The data is provided in two formats: SQL insert statements (.sql) and CSV files (.csv). Please use the format most convenient for you. Note that the SQL insert statements take much longer to run since each row is an individual insert. Description of zip files The ltararchive*.zip files contain database exports. The schema is a .sql file; the data is exported as both SQL inserts and CSV for convenience. There is a README in markdown and PDF in the zips. Contains the database export of the schema and data for the site, site_station, and met tables as SQL insert statements. ltar_archive_db_sql_export_20201231.zip --> has data until 2020-12-31 ltar_archive_db_sql_export_20210430.zip --> has data until 2021-04-30 Contains the database export of the schema and data for the site, site_station, and met tables as CSV. ltar_archive_db_csv_export_20201231.zip --> has data until 2020-12-31 ltar_archive_db_csv_export_20210430.zip --> has data until 2021-04-30 Contains the raw CSV files that were sent to NAL from the LTAR sites/stations. ltar_rawcsv_archive.zip --> has data until 2021-04-30

  17. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  18. Z

    Metadata of a Large Sonar and Stereo Camera Dataset Suitable for...

    • data.niaid.nih.gov
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Backe, Christian; Wehbe, Bilal; Bande, Miguel; Shah, Nimish; Cesar, Diego; Pribbernow, Max (2024). Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10373153
    Explore at:
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    German Research Center for Artificial Intelligence (DFKI)
    Kraken Robotik GmbH
    Authors
    Backe, Christian; Wehbe, Bilal; Bande, Miguel; Shah, Nimish; Cesar, Diego; Pribbernow, Max
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation

    Introduction

    This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.

    Locations and sensors

    The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.

    Data volume per session

    Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:

    Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]

    2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1

    2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3

    2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8

    2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9

    2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6

    2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3

    255 40.3 542.7 3’914’921 287.0

    Data and metadata structure

    Sensor data corpus

    The sensor data corpus comprises two processing stages:

    raw data streams stored in ROS bagfiles (aka logfiles),

    camera and sonar images (aka datafiles) extracted from the logfiles.

    The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:

    ${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}

    A typical logfile path has this form:

    2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag

    A typical datafile path has this form:

    2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg

    All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.

    Metadatabase

    The metadatabase is provided in two equivalent forms:

    as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,

    as a collection of CSV files in the csv/ directory for users who prefer other tools.

    The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.

    An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json

    Some general design remarks:

    For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.

    In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.

    A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.

    As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:

    SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;

  19. D

    SQL Query Audit Tools Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). SQL Query Audit Tools Market Research Report 2033 [Dataset]. https://dataintelo.com/report/sql-query-audit-tools-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    SQL Query Audit Tools Market Outlook



    According to our latest research, the SQL Query Audit Tools market size reached USD 1.26 billion in 2024, reflecting robust adoption across multiple industries. The market is projected to expand at a CAGR of 13.2% from 2025 to 2033, culminating in a forecasted market value of USD 3.69 billion by 2033. This substantial growth trajectory is primarily driven by the escalating demand for robust database security and compliance solutions in an era marked by increasingly stringent data privacy regulations and an upsurge in cyber threats targeting sensitive business information.




    One of the most significant growth factors for the SQL Query Audit Tools market is the rising complexity and volume of enterprise data. Organizations across sectors are generating and handling massive amounts of structured and unstructured data, necessitating advanced auditing mechanisms to ensure data integrity, compliance, and security. The proliferation of digital transformation initiatives, cloud migration, and the adoption of big data analytics have further underscored the need for sophisticated tools capable of auditing SQL queries in real-time. These tools not only help organizations identify suspicious activities and unauthorized access but also play a pivotal role in maintaining regulatory compliance, especially in highly regulated industries such as banking, financial services, and healthcare.




    The growing regulatory landscape is another key driver propelling the SQL Query Audit Tools market. Governments and regulatory bodies worldwide have introduced stringent data protection laws such as GDPR, HIPAA, and CCPA, compelling organizations to implement comprehensive audit trails for all database activities. SQL query audit tools offer granular visibility into database transactions, enabling companies to demonstrate compliance and avoid hefty fines associated with non-compliance. Furthermore, as cyberattacks become more sophisticated, organizations are increasingly recognizing the value of proactive monitoring and auditing solutions that can detect anomalies, prevent data breaches, and support forensic investigations in the event of security incidents.




    Technological advancements and the integration of artificial intelligence and machine learning into SQL query audit tools are also fueling market expansion. Modern solutions are leveraging AI-driven analytics to automate anomaly detection, streamline compliance reporting, and enhance the accuracy of security alerts. Additionally, the shift towards cloud-based deployments is making these tools more accessible to small and medium enterprises (SMEs), which historically faced barriers due to high upfront costs and resource constraints. The combination of technological innovation, regulatory pressure, and the increasing importance of data governance is expected to sustain the strong growth momentum of the SQL Query Audit Tools market in the coming years.




    Regionally, North America currently dominates the SQL Query Audit Tools market, accounting for the largest share in 2024, followed by Europe and the Asia Pacific. The United States, in particular, is witnessing significant adoption driven by the presence of large enterprises, advanced IT infrastructure, and a highly regulated business environment. Europe is also experiencing robust growth, fueled by stringent data protection regulations and increasing investments in cybersecurity solutions. Meanwhile, the Asia Pacific region is poised for the fastest growth over the forecast period, supported by rapid digitalization, expanding IT and telecommunications sectors, and rising awareness about data security among enterprises in emerging economies such as China and India.



    Component Analysis



    The Component segment of the SQL Query Audit Tools market is bifurcated into software and services, each playing a critical role in the overall ecosystem. Software solutions form the backbone of the market, encompassing standalone audit tools, integrated database management platforms, and advanced analytics engines. These software offerings are designed to monitor, log, and analyze SQL queries in real-time, providing detailed audit trails and actionable insights for security, compliance, and performance optimization. The demand for feature-rich, scalable, and user-friendly software is on the rise as organizations seek to automate audit processes and minimize manual intervention.



    &

  20. G

    Database Performance Monitoring Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Database Performance Monitoring Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/database-performance-monitoring-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Database Performance Monitoring Market Outlook



    According to our latest research, the global database performance monitoring market size reached USD 2.47 billion in 2024. The market is experiencing robust expansion, driven by the increasing complexity of database environments and the critical need for real-time data access and analytics. With a compound annual growth rate (CAGR) of 13.2% from 2025 to 2033, the market is projected to reach USD 6.41 billion by 2033. The surge in digital transformation initiatives, cloud migration, and the proliferation of data-intensive applications are among the key factors propelling this growth trajectory.




    One of the primary growth drivers for the database performance monitoring market is the exponential rise in data generation across industries. Organizations are increasingly leveraging advanced analytics, artificial intelligence, and machine learning, which require high-performing, reliable databases to deliver actionable insights in real time. As enterprises adopt multi-cloud and hybrid environments, the challenges of managing and monitoring database performance intensify, necessitating sophisticated monitoring solutions. These solutions offer proactive identification and resolution of performance bottlenecks, ensuring business continuity and optimal user experiences. The emphasis on digital agility and operational efficiency further underscores the importance of investing in robust database performance monitoring tools.




    Another significant factor contributing to market growth is the evolving regulatory landscape and the need for compliance across sectors such as BFSI, healthcare, and government. Regulatory requirements around data integrity, security, and availability have made database monitoring indispensable for organizations aiming to avoid costly downtime and potential penalties. As cyber threats become more sophisticated, database performance monitoring solutions play a crucial role in detecting anomalies, preventing data breaches, and maintaining compliance with global standards such as GDPR, HIPAA, and PCI DSS. The integration of advanced features like predictive analytics, automated troubleshooting, and real-time alerting further enhances the value proposition of these solutions, making them a vital component of modern IT infrastructure.




    The market is also being shaped by the rapid adoption of cloud-based database solutions. As enterprises migrate their workloads to public, private, and hybrid clouds, the need for cloud-native monitoring capabilities becomes paramount. Cloud-based database performance monitoring tools offer scalability, flexibility, and seamless integration with diverse cloud platforms, enabling organizations to manage complex, distributed environments efficiently. The shift towards DevOps and agile development practices has also accelerated the demand for continuous monitoring and performance optimization throughout the application lifecycle. This trend is particularly pronounced among small and medium enterprises, which are leveraging cloud-based solutions to compete with larger players and drive innovation.




    Regionally, North America continues to dominate the database performance monitoring market, accounting for the largest market share in 2024. The regionÂ’s leadership is attributed to the high concentration of technology-driven enterprises, early adoption of advanced IT solutions, and substantial investments in cloud infrastructure. Europe and Asia Pacific are also witnessing significant growth, fueled by increasing digitalization, expanding IT budgets, and the emergence of new business models. In particular, Asia Pacific is expected to exhibit the highest CAGR during the forecast period, driven by rapid economic development, rising internet penetration, and a burgeoning startup ecosystem. The competitive landscape is characterized by the presence of global and regional players, each striving to enhance their offerings through innovation and strategic partnerships.



    As organizations strive to optimize their database environments, SQL Performance Tuning Tools have become indispensable. These tools are designed to enhance the efficiency of SQL queries, reduce response times, and improve overall database performance. By analyzing query execution plans and identifying bottlenecks, SQL Performance Tuning Tools enable database administrato

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2022). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dataset-for-developing-natural-language-interfac
Organization logo

WikiSQL (Questions and SQL Queries)

80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

Explore at:
zip(21491264 bytes)Available download formats
Dataset updated
Nov 25, 2022
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

WikiSQL (Questions and SQL Queries)

80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

By Huggingface Hub [source]

About this dataset

A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

More Datasets

For more datasets, click here.

Featured Notebooks

  • 🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

Research Ideas

  • This dataset can be used to develop natural language interfaces for relational databases.
  • This dataset can be used to develop a knowledge base of common SQL queries.
  • This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

Search
Clear search
Close search
Google apps
Main menu