Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface
- This dataset can be used to develop natural language interfaces for relational databases.
- This dataset can be used to develop a knowledge base of common SQL queries.
- This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is built for Text-to-SQL (NL → SQL) tasks, helping train models to convert natural language into SQL queries. It is ideal for fine-tuning LLMs, developing AI-powered database assistants, and improving SQL query generation accuracy.
Each row contains the following fields:
- 📝 Instruction – A natural language query (e.g., "Find all customers who placed an order in the last 30 days.")
- 📊 Query – The corresponding SQL statement (e.g., SELECT * FROM orders WHERE order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY);)
- 🗄️ Database – Contains metadata such as:
- Table Names – The relevant tables for the query (e.g., orders, customers)
- Column Names – The specific fields used in the query (e.g., order_date, customer_id)
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the SQL Query Optimization with AI market size reached USD 1.32 billion in 2024, propelled by the rapid adoption of artificial intelligence in database management and analytics. The market is projected to grow at a robust CAGR of 22.1% from 2025 to 2033, reaching a forecasted value of USD 9.85 billion by 2033. This remarkable growth is primarily driven by the increasing need for real-time data processing, the proliferation of complex data environments, and the demand for enhanced application performance across industries.
The surge in digital transformation initiatives across various sectors is one of the most significant growth factors for the SQL Query Optimization with AI market. Enterprises are increasingly relying on data-driven decision-making, which necessitates efficient and scalable database systems. AI-powered SQL query optimization tools help organizations streamline query execution, reduce latency, and maximize resource utilization. With the explosion of big data and the adoption of cloud-based infrastructures, businesses are seeking advanced solutions that can automate query tuning, detect anomalies, and dynamically adapt to changing workloads. The integration of machine learning algorithms into SQL optimization processes is enabling predictive analytics, self-healing databases, and automated performance tuning, further fueling market expansion.
Another key driver is the escalating complexity of enterprise data ecosystems. Organizations today manage vast volumes of structured and unstructured data from multiple sources, including IoT devices, transactional systems, and external APIs. As data environments grow more intricate, manual query optimization becomes increasingly impractical and error-prone. AI-driven SQL optimization platforms address these challenges by continuously monitoring query performance, identifying bottlenecks, and suggesting optimal execution plans. This not only improves database efficiency but also reduces the burden on database administrators, allowing them to focus on higher-value tasks. The growing adoption of hybrid and multi-cloud strategies is also contributing to the demand for intelligent query optimization solutions that ensure consistent performance across diverse environments.
Furthermore, the rise of regulatory compliance requirements and data privacy concerns is pushing organizations to invest in advanced database management solutions. AI-powered SQL query optimization tools can help ensure data integrity, minimize risks, and maintain compliance with industry standards such as GDPR, HIPAA, and PCI DSS. By automating query auditing, access control, and anomaly detection, these solutions enhance security and transparency in data operations. The increasing emphasis on customer experience, operational agility, and cost optimization is prompting enterprises to adopt AI-enabled query optimization as a strategic differentiator, driving sustained growth in the market.
From a regional perspective, North America currently dominates the SQL Query Optimization with AI market, accounting for the largest revenue share due to the presence of leading technology vendors, early adoption of AI, and a mature IT infrastructure. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid digitalization, expanding cloud adoption, and the emergence of data-centric business models in countries like China, India, and Japan. Europe is also experiencing steady growth, fueled by stringent data protection regulations and increasing investments in AI-driven database management solutions. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, supported by government initiatives to promote digital transformation and the growing penetration of cloud services.
The Component segment of the SQL Query Optimization with AI market is categorized into Software, Hardware, and Services. Software solutions represent the largest share of the market, as they form the backbone of AI-driven query optimization processes. These include advanced query analyzers, AI-powered database management platforms, and automated performance tuning tools that leverage machine learning algorithms to optimize SQL queries in real time. The proliferation of open-source frameworks and the integration of AI capabilities into existing database manage
Facebook
Twitter
According to our latest research, the global SQL Acceleration Engine market size reached USD 2.3 billion in 2024, exhibiting robust momentum driven by the rapid digital transformation across industries. The market is set to expand at a CAGR of 16.2% from 2025 to 2033, propelling the total market value to approximately USD 9.4 billion by 2033. This remarkable growth is primarily fueled by the escalating demand for real-time analytics, the proliferation of big data, and the increasing adoption of cloud-based solutions. As per our latest research, organizations worldwide are prioritizing data-driven decision-making, thereby accelerating investments in advanced SQL acceleration engines to optimize database performance and reduce query latency.
The primary growth factor underpinning the SQL Acceleration Engine market is the exponential increase in data generation from diverse sources such as IoT devices, social media, enterprise applications, and e-commerce platforms. Enterprises are grappling with the challenge of processing and analyzing massive volumes of structured and unstructured data efficiently. SQL acceleration engines play a pivotal role in enhancing the speed and efficiency of SQL queries, which is critical for delivering timely insights and maintaining a competitive edge. This surge in data-centric operations has compelled organizations to seek advanced solutions capable of handling complex queries and large datasets, thereby driving market expansion.
Another significant driver is the widespread adoption of cloud computing across various sectors. Cloud-based SQL acceleration engines offer scalability, flexibility, and cost-effectiveness, enabling organizations to seamlessly manage fluctuating workloads and data volumes. The shift towards hybrid and multi-cloud environments further amplifies the need for advanced SQL acceleration solutions that can ensure high performance and low latency regardless of deployment architecture. Additionally, the integration of artificial intelligence and machine learning into SQL acceleration engines is enhancing their capabilities, allowing for automated query optimization and intelligent workload management, which further propels market growth.
The increasing focus on real-time analytics and business intelligence is also contributing to the market’s robust growth trajectory. Modern enterprises require instant access to actionable insights to make informed decisions, streamline operations, and enhance customer experiences. SQL acceleration engines enable rapid query processing, facilitating real-time data analysis and visualization. This is particularly crucial in sectors such as BFSI, healthcare, and retail, where timely insights can significantly impact business outcomes. Furthermore, the growing emphasis on digital transformation and the adoption of advanced analytics tools are expected to sustain the demand for SQL acceleration engines in the foreseeable future.
From a regional perspective, North America dominates the SQL Acceleration Engine market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The region’s leadership is attributed to the presence of major technology providers, early adoption of advanced database solutions, and substantial investments in cloud infrastructure. Asia Pacific, on the other hand, is witnessing the fastest growth, driven by the rapid digitization of enterprises, expanding IT sector, and increasing adoption of cloud-based analytics solutions. Meanwhile, Europe continues to demonstrate steady growth, supported by stringent data regulations and a strong focus on data-driven innovation across industries.
The SQL Acceleration Engine market is segmented by component into software, hardware, and services, each playing a distinct role in the overall ecosystem. The software segment holds the largest share, driven by the continuous innovation in SQL query optimization algorithms and the integration of advanced analytics
Facebook
TwitterIn the past, the majority of data analysis use cases was addressed by aggregating relational data. Since a few years, a trend is evolving, which is called “Big Data” and which has several implications on the field of data analysis. Compared to previous applications, much larger data sets are analyzed using more elaborate and diverse analysis methods such as information extraction techniques, data mining algorithms, and machine learning methods. At the same time, analysis applications include data sets with less or even no structure at all. This evolution has implications on the requirements on data processing systems. Due to the growing size of data sets and the increasing computational complexity of advanced analysis methods, data must be processed in a massively parallel fashion. The large number and diversity of data analysis techniques as well as the lack of data structure determine the use of user-defined functions and data types. Many traditional database systems are not flexible enough to satisfy these requirements. Hence, there is a need for programming abstractions to define and efficiently execute complex parallel data analysis programs that support custom user-defined operations. The success of the SQL query language has shown the advantages of declarative query specification, such as potential for optimization and ease of use. Today, most relational database management systems feature a query optimizer that compiles declarative queries into physical execution plans. Cost-based optimizers choose from billions of plan candidates the plan with the least estimated cost. However, traditional optimization techniques cannot be readily integrated into systems that aim to support novel data analysis use cases. For example, the use of user-defined functions (UDFs) can significantly limit the optimization potential of data analysis programs. Furthermore, lack of detailed data statistics is common when large amounts of unstructured data is analyzed. This leads to imprecise optimizer cost estimates, which can cause sub-optimal plan choices. In this thesis we address three challenges that arise in the context of specifying and optimizing data analysis programs. First, we propose a parallel programming model with declarative properties to specify data analysis tasks as data flow programs. In this model, data processing operators are composed of a system-provided second-order function and a user-defined first-order function. A cost-based optimizer compiles data flow programs specified in this abstraction into parallel data flows. The optimizer borrows techniques from relational optimizers and ports them to the domain of general-purpose parallel programming models. Second, we propose an approach to enhance the optimization of data flow programs that include UDF operators with unknown semantics. We identify operator properties and conditions to reorder neighboring UDF operators without changing the semantics of the program. We show how to automatically extract these properties from UDF operators by leveraging static code analysis techniques. Our approach is able to emulate relational optimizations such as filter and join reordering and holistic aggregation push-down while not being limited to relational operators. Finally, we analyze the impact of changing execution conditions such as varying predicate selectivities and memory budgets on the performance of relational query plans. We identify plan patterns that cause significantly varying execution performance for changing execution conditions. Plans that include such risky patterns are prone to cause problems in presence of imprecise optimizer estimates. Based on our findings, we introduce an approach to avoid risky plan choices. Moreover, we present a method to assess the risk of a query execution plan using a machine-learned prediction model. Experiments show that the prediction model outperforms risk predictions which are computed from optimizer estimates.
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Structured Query Language (SQL) server transformation market is experiencing robust growth, projected to reach $15 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 9.4% from 2025 to 2033. This expansion is fueled by several key drivers. The increasing adoption of cloud-based solutions and the rise of big data analytics are pushing organizations to adopt more efficient and scalable SQL server solutions. Furthermore, the growing demand for real-time data processing and improved data integration capabilities within large enterprises and SMEs is significantly driving market growth. The market segmentation reveals strong demand across various application areas, with large enterprises leading the way due to their greater need for robust and scalable data management infrastructure. Data integration scripts remain a prominent segment, highlighting the critical need for seamless data flow across diverse systems. The competitive landscape is marked by established players like Oracle, IBM, and Microsoft, alongside emerging innovative companies specializing in cloud-based SQL server technologies. Geographic analysis suggests North America and Europe currently hold the largest market share, but significant growth potential exists in the Asia-Pacific region, driven by rapid digital transformation and economic growth in countries like India and China. The restraints on market growth are primarily related to the complexities involved in migrating existing legacy systems to new SQL server solutions, along with the need for skilled professionals to manage and optimize these systems. However, the ongoing advancements in automation tools and the increased availability of training programs are mitigating these challenges. The future trajectory of the market indicates continued growth, driven by emerging technologies such as AI-powered query optimization, enhanced security features, and the growing adoption of serverless architectures. This will lead to a wider adoption of SQL server transformation across various sectors, including finance, healthcare, and retail, as organizations seek to leverage data to gain competitive advantage and improve operational efficiency. The market is ripe for innovation and consolidation, with opportunities for both established players and new entrants to capitalize on this ongoing transformation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
It contains the following files:
- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license
The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.
This dataset is distributed under the CC BY-SA 4.0 license.
If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.
@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}
@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}
@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}
@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}
@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}
@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}
@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}
@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}
@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}
@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global SQL Performance Tuning Tools market size reached USD 1.42 billion in 2024, exhibiting robust expansion driven by the surging need for optimized database management and real-time analytics across enterprises. The market is poised to grow at a CAGR of 9.7% from 2025 to 2033, with the forecasted value expected to hit USD 3.27 billion by 2033. This growth is primarily attributed to the increasing complexity of database environments, the proliferation of data-driven applications, and the urgent demand for high availability and efficiency in mission-critical business operations. As organizations continue to digitize and scale their infrastructure, SQL performance tuning tools are becoming indispensable for ensuring seamless data processing and superior user experiences.
A significant growth factor for the SQL Performance Tuning Tools market is the exponential increase in data volumes generated by organizations worldwide. Enterprises are embracing digital transformation initiatives, leading to a surge in transactional and analytical workloads that demand high-performing databases. SQL performance tuning tools play a pivotal role in identifying, diagnosing, and resolving performance bottlenecks within SQL queries and database configurations. With the adoption of advanced analytics, artificial intelligence, and machine learning, organizations are generating and processing more data than ever before, necessitating robust tools to ensure optimal database performance. This trend is particularly pronounced in sectors such as BFSI, healthcare, and e-commerce, where data-driven decision-making and real-time insights are critical for competitive advantage.
Another key driver is the growing complexity of IT environments, particularly with the rise of hybrid and multi-cloud deployments. As enterprises migrate workloads to cloud platforms and integrate on-premises systems with cloud-based solutions, managing and tuning SQL databases becomes increasingly challenging. SQL performance tuning tools enable IT teams to monitor and optimize database performance across diverse and distributed environments, ensuring consistency, reliability, and scalability. These tools offer advanced features such as automated query optimization, real-time monitoring, and predictive analytics, which are essential for maintaining service-level agreements (SLAs) and minimizing downtime. The increasing reliance on cloud infrastructure, coupled with the need for agile and resilient database management, is expected to further propel market growth.
The expanding ecosystem of database technologies and the proliferation of open-source SQL databases are also fueling demand for performance tuning solutions. Organizations are adopting a wide range of relational and non-relational databases to support diverse workloads, leading to greater heterogeneity in database environments. This diversity introduces new challenges in performance management, as traditional tuning methods may not be effective across different platforms. SQL performance tuning tools are evolving to support a broad spectrum of database engines, providing unified visibility and optimization capabilities. As businesses strive to deliver high-quality digital experiences and minimize operational costs, the adoption of advanced tuning tools is becoming a strategic imperative.
From a regional perspective, North America continues to dominate the SQL Performance Tuning Tools market, accounting for the largest share in 2024. This leadership is driven by the presence of major technology vendors, a mature IT infrastructure, and early adoption of advanced database management solutions. Europe and Asia Pacific are also witnessing rapid growth, fueled by increasing investments in digital transformation, expanding IT services sectors, and the rise of cloud computing. The Asia Pacific region, in particular, is expected to exhibit the highest CAGR during the forecast period, supported by the proliferation of SMEs, growing e-commerce activities, and government initiatives to promote digital innovation. Meanwhile, Latin America and the Middle East & Africa are emerging as promising markets, albeit at a relatively nascent stage, as organizations in these regions modernize their IT landscapes and embrace data-driven strategies.
The SQL Performance Tuning Tools market by component is broadly segmented into software and servi
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.
I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.
Key Features:
Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects
The database consists of four main tables:
This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.
https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data
Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings
Usage with LIKE queries: ``` import aiosqlite import asyncio
class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file
async def _aenter_(self):
self.conn = await aiosqlite.connect(self.db_file)
return self
async def _aexit_(self, exc_type, exc_val, exc_tb):
await self.conn.close()
async def search_pages_by_title(self, title):
query = """
SELECT pages.page_id, pages.item_id, pages.title, pages.views,
items.labels AS item_labels, items.description AS item_description,
link_annotated_text.sections
FROM pages
JOIN items ON pages.item_id = items.id
JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id
WHERE pages.title LIKE ?
"""
async with self.conn.execute(query, (f"%{title}%",)) as cursor:
return await cursor.fetchall()
async def search_items_by_label_or_description(self, keyword):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ? OR description LIKE ?
"""
async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor:
return await cursor.fetchall()
async def search_items_by_label(self, label):
query = """
SELECT id, labels, description
FROM items
WHERE labels LIKE ?
"""
async with self.conn.execute(query, (f"%{label}%",)) as cursor:
return await cursor.fetchall()
async def search_properties_by_label_or_desc...
Facebook
Twitter
According to our latest research, the global SQL Query Engine market size in 2024 stands at USD 3.84 billion, reflecting robust growth driven by the increasing demand for efficient data management and analytics solutions across industries. The market is projected to expand at a CAGR of 12.1% from 2025 to 2033, reaching an estimated value of USD 10.77 billion by the end of the forecast period. This remarkable growth is underpinned by the escalating volume of structured and unstructured data, the proliferation of cloud-based applications, and the widespread adoption of advanced analytics and business intelligence tools.
One of the primary growth factors driving the SQL Query Engine market is the exponential increase in data generation from digital transformation initiatives, IoT devices, and enterprise applications. Organizations are increasingly relying on SQL query engines to extract actionable insights from vast datasets, enabling informed decision-making and operational efficiency. The integration of SQL engines with big data platforms and cloud environments further amplifies their utility, as businesses seek scalable and high-performance solutions that can seamlessly handle complex queries across distributed data sources. This trend is particularly pronounced in industries such as BFSI, healthcare, and retail, where real-time data analysis is critical for competitive advantage and regulatory compliance.
Another significant driver is the rapid evolution of cloud computing and the migration of enterprise workloads to cloud platforms. Cloud-based SQL query engines offer flexibility, scalability, and cost-effectiveness, making them highly attractive to organizations looking to modernize their IT infrastructure. The ability to run SQL queries on cloud-native data warehouses and integrate with various analytics tools has democratized access to advanced data capabilities, even for small and medium enterprises. Furthermore, innovations in query optimization, parallel processing, and support for hybrid and multi-cloud deployments are fostering greater adoption of SQL query engines across diverse business environments.
The market is also benefiting from the growing emphasis on business intelligence and data-driven decision-making. Enterprises are leveraging SQL query engines to power dashboards, generate real-time reports, and facilitate self-service analytics for non-technical users. Enhanced support for structured query language, improved user interfaces, and integration with visualization tools are making it easier for business users to interact with data, driving broader usage across organizations. Additionally, the rise of data integration and analytics as core business functions is pushing vendors to continuously innovate, offering advanced features such as in-memory processing, machine learning integration, and support for semi-structured data formats.
Regionally, North America continues to dominate the SQL Query Engine market, accounting for the largest revenue share in 2024. This is attributed to the strong presence of technology giants, early adoption of cloud technologies, and a thriving ecosystem of data-driven enterprises. However, Asia Pacific is expected to exhibit the fastest growth during the forecast period, fueled by rapid digitalization, increasing investments in cloud infrastructure, and the emergence of new business models in countries such as China, India, and Japan. Europe, Latin America, and the Middle East & Africa are also witnessing steady growth, supported by regulatory mandates for data governance and the rising importance of analytics in public and private sectors.
The SQL Query Engine market is segmented by component into Software and Services. The software segment commands a substantial share of the market, as enterprises increasingly invest in advanced query engines to enhance their data processing and analytics capabilities. Modern SQL query engine software offers robust features such as distributed query pro
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
synthetic_text_to_sql
gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:
105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).
In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.
Format
The dataset is organized as follows:
blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.
The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:
blobs/ is the root directory containing all license blobs
8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:
$ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007
$ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02
86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1
One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):
swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"
blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs
license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"
where:
SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2
SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory
NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).
blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:
SHA1: blob SHA1
MIME_TYPE: blob MIME type, as detected by libmagic
ENCODING: blob character encoding, as detected by libmagic
LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)
WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)
SIZE: blob size in bytes
blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:
SHA1: blob SHA1
LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)
SCORE: confidence score in the result, as a decimal number between 0 and 100
There may be zero or arbitrarily many lines for each blob.
blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:
sha1: blob SHA1
licenses: output of scancode.api.get_licenses(..., min_score=0)
copyrights: output of scancode.api.get_copyrights(...)
There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.
blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHIDURL, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis
Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.
If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.
blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHIDNUMBER, for example:
swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260
Two blobs are missing because the computation crashes:
swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc
This issue will be fixed in a future version of the dataset
blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHIDEARLIEST_SWHIDEARLIEST_TSOCCURRENCES, where:
SWHID: blob SWHID
EARLIEST_SWHID: SWHID of the earliest known commit containing the blob
EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer
OCCURRENCES: number of known commits containing the blob
replication-package.tar.gz: code and scripts used to produce the dataset
licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.
Changes since the 2021-03-23 dataset
More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.
Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.
Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.
blobs-nb-origins.csv.zst is added.
blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.
blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.
blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.
blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)
blobs-scancode.ndjson.zst is added.
Errata
A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:
pv blobs-fileinfo.csv.zst | zstdcat | grep -v ".tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12
The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.
Citation
If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:
[pdf, bib] Jesús M. González-Barahona, Sergio Raúl Montes León, Gregorio Robles, Stefano Zacchiroli. The software heritage license dataset (2022 edition). Empirical Software Engineering, Volume 28, Number 6, Article number 147 (2023).
[pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
References
The dataset has been built using primarily the data sources described in the following papers:
[pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.
[pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.
Errata (v2, 2024-01-09)
licenses-annotated-sample.tar.gz: some comments not intended for publication were removed, and 4
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is a Portuguese-translated version of the b-mc2/sql-create-context dataset, which was constructed from the WikiSQL and Spider datasets. It contains examples of questions in Portuguese, SQL CREATE TABLE statements, and SQL queries that answer the questions using the CREATE TABLE statement as context.
The main goal of this dataset is to assist Portuguese natural language models in generating precise and contextualized SQL queries, preventing the hallucination of column and table names, a common issue in text-to-SQL datasets. By providing only the CREATE TABLE statement as context, the dataset aims to better ground the models without the need to provide actual data rows, limiting token use and exposure to private, sensitive, or proprietary data.
The questions were translated into Portuguese using the facebook/nllb-200-distilled-1.3B model, ensuring that the natural language queries maintain the same meaning and context as the original English questions.
This dataset is ideal for training natural language models for SQL query generation, especially in scenarios where accuracy in naming columns and tables is crucial. It can be used to enhance model performance in text-to-SQL tasks, providing clear context and avoiding common hallucination errors.
@misc{b-mc2_2023_sql-create-context,
title = {sql-create-context Dataset},
author = {b-mc2},
year = {2023},
url = {https://huggingface.co/datasets/b-mc2/sql-create-context},
note = {This dataset was created by modifying data from the following sources: \cite{zhongSeq2SQL2017, yu2018spider}.},
}
@article{zhongSeq2SQL2017,
author = {Victor Zhong and Caiming Xiong and Richard Socher},
title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning},
journal = {CoRR},
volume = {abs/1709.00103},
year = {2017}
}
@article{yu2018spider,
title = {Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task},
author = {Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
journal = {arXiv preprint arXiv:1809.08887},
year = {2018}
}
Facebook
TwitterThis dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global managed Presto services market size reached USD 1.37 billion in 2024, reflecting strong demand for scalable, high-performance data analytics solutions across industries. The market is expected to witness a robust compound annual growth rate (CAGR) of 19.2% from 2025 to 2033, driven by the increasing adoption of cloud-based analytics platforms, growing volumes of enterprise data, and the need for real-time business insights. By 2033, the managed Presto services market is projected to reach USD 5.94 billion, underscoring a transformative shift in how organizations leverage open-source query engines for big data analytics and business intelligence.
Several key growth factors are propelling the managed Presto services market forward. The exponential rise in data generation, fueled by digital transformation, IoT proliferation, and the adoption of advanced analytics, has compelled organizations to seek more efficient, scalable, and cost-effective data processing solutions. Presto, as an open-source distributed SQL query engine, is increasingly favored for its ability to perform fast, interactive analytics on large datasets across diverse data sources. Managed Presto services further enhance this value proposition by providing enterprises with fully managed, optimized, and secure environments, reducing the operational burden on IT teams and accelerating time-to-insight. This shift is particularly pronounced among organizations lacking in-house expertise or resources to manage complex data infrastructure, making managed Presto services an attractive alternative.
Another significant driver is the growing demand for cloud-native analytics solutions. As businesses migrate their data and analytics workloads to the cloud, managed Presto services offer seamless integration with major cloud platforms, ensuring high availability, scalability, and flexibility. The cloud deployment model enables organizations to dynamically scale resources based on demand, optimize costs, and benefit from continuous updates and security enhancements provided by managed service providers. This trend is further amplified by the increasing adoption of hybrid and multi-cloud strategies, as enterprises seek to avoid vendor lock-in and maintain agility in their data operations. The synergy between Presto's federated query capabilities and the cloud's elastic infrastructure is creating new opportunities for innovation and data-driven decision-making.
The managed Presto services market also benefits from the rising importance of real-time analytics and business intelligence in driving competitive advantage. Organizations across industries are leveraging Presto's ability to query data where it resides, whether in data lakes, warehouses, or external sources, to derive actionable insights with minimal latency. Managed services providers are enhancing their offerings with advanced features such as automated scaling, intelligent workload management, integrated security, and comprehensive monitoring, further increasing the appeal of Presto-based solutions. These advancements are enabling enterprises to unlock the full potential of their data assets, improve operational efficiency, and respond swiftly to changing market dynamics.
From a regional perspective, North America currently dominates the managed Presto services market, accounting for the largest share in 2024, followed by Europe and Asia Pacific. The region's leadership is attributed to the early adoption of advanced analytics technologies, a mature cloud ecosystem, and the presence of major technology vendors and hyperscale cloud providers. Europe is witnessing steady growth, driven by increasing investments in digital transformation and stringent data privacy regulations, while Asia Pacific is emerging as a high-growth market due to rapid digitalization, expanding IT infrastructure, and the proliferation of data-driven business models. Latin America and the Middle East & Africa are also expected to register notable growth rates over the forecast period, supported by rising awareness of big data analytics and government-led digital initiatives.
The managed Presto services market is segmented by component into software and services, each playing a pivotal role in the market’s overall value proposition. The software segment encompasses the core Presto query engine, manage
Facebook
TwitterThe LTAR network maintains stations for standard meteorological measurements including, generally, air temperature and humidity, shortwave (solar) irradiance, longwave (thermal) radiation, wind speed and direction, barometric pressure, and precipitation. Many sites also have extensive comparable legacy datasets. The LTAR scientific community decided that these needed to be made available to the public using a single web source in a consistent manner. To that purpose, each site sent data on a regular schedule, as frequently as hourly, to the National Agricultural Library, which has developed a web service to provide the data to the public in tabular or graphical form. This archive of the LTAR legacy database exports contains meteorological data through April 30, 2021. For current meteorological data, visit the GeoEvent Meteorology Resources page, which provides tools and dashboards to view and access data from the 18 LTAR sites across the United States. Resources in this dataset:Resource Title: Meteorological data. File Name: ltar_archive_DB.zipResource Description: This is an export of the meteorological data collected by LTAR sites and ingested by the NAL LTAR application. This export consists of an SQL schema definition file for creating database tables and the data itself. The data is provided in two formats: SQL insert statements (.sql) and CSV files (.csv). Please use the format most convenient for you. Note that the SQL insert statements take much longer to run since each row is an individual insert. Description of zip files The ltararchive*.zip files contain database exports. The schema is a .sql file; the data is exported as both SQL inserts and CSV for convenience. There is a README in markdown and PDF in the zips. Contains the database export of the schema and data for the site, site_station, and met tables as SQL insert statements. ltar_archive_db_sql_export_20201231.zip --> has data until 2020-12-31 ltar_archive_db_sql_export_20210430.zip --> has data until 2021-04-30 Contains the database export of the schema and data for the site, site_station, and met tables as CSV. ltar_archive_db_csv_export_20201231.zip --> has data until 2020-12-31 ltar_archive_db_csv_export_20210430.zip --> has data until 2021-04-30 Contains the raw CSV files that were sent to NAL from the LTAR sites/stations. ltar_rawcsv_archive.zip --> has data until 2021-04-30
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts| Column Name | Type | Description |
|---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments| Column Name | Type | Description |
|---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks| Column Name | Type | Description |
|---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks| Column Name | Type | Description |
|---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metadata of a Large Sonar and Stereo Camera Dataset Suitable for Sonar-to-RGB Image Translation
Introduction
This is a set of metadata describing a large dataset of synchronized sonar and stereo camera recordings, that were captured between August 2021 and September 2023 during the project DeeperSense (https://robotik.dfki-bremen.de/en/research/projects/deepersense/), as training data for Sonar-to-RGB image translation. Parts of the sensor data have been published (https://zenodo.org/records/7728089, https://zenodo.org/records/10220989). Due to the size of the sensor data corpus, it is currently impractical to make the entire corpus accessible online. Instead, this metadatabase serves as a relatively compact representation, allowing interested researchers to inspect the data, and select relevant portions for their particular use case, which will be made available on demand. This is an effort to comply with the FAIR principle A2 (https://www.go-fair.org/fair-principles/) that metadata shall be accessible, even when the base data is not immediately.
Locations and sensors
The sensor data was captured at four different locations, including one laboratory (Maritime Exploration Hall at DFKI RIC Bremen) and three field locations (Chalk Lake Hemmoor, Tank Wash Basin Neu-Ulm, Lake Starnberg). At all locations, a ZED camera and a Blueprint Oculus M1200d sonar were used. Additionally, a SeaVision camera was used at the Maritime Exploration Hall at DFKI RIC Bremen and at the Chalk Lake Hemmoor. The examples/ directory holds a typical output image for each sensor at each available location.
Data volume per session
Six data collection sessions were conducted. The table below presents an overview of the amount of data captured in each session:
Session dates Location Number of datasets Total duration of datasets [h] Total logfile size [GB] Number of images Total image size [GB]
2021-08-09 - 2021-08-12 Maritime Exploration Hall at DFKI RIC Bremen 52 10.8 28.8 389’047 88.1
2022-02-07 - 2022-02-08 Maritime Exploration Hall at DFKI RIC Bremen 35 4.4 54.1 629’626 62.3
2022-04-26 - 2022-04-28 Chalk Lake Hemmoor 52 8.1 133.6 1’114’281 97.8
2022-06-28 - 2022-06-29 Tank Wash Basin Neu-Ulm 42 6.7 144.2 824’969 26.9
2023-04-26 - 2023-04-27 Maritime Exploration Hall at DFKI RIC Bremen 55 7.4 141.9 739’613 9.6
2023-09-01 - 2023-09-02 Lake Starnberg 19 2.9 40.1 217’385 2.3
255 40.3 542.7 3’914’921 287.0
Data and metadata structure
Sensor data corpus
The sensor data corpus comprises two processing stages:
raw data streams stored in ROS bagfiles (aka logfiles),
camera and sonar images (aka datafiles) extracted from the logfiles.
The files are stored in a file tree hierarchy which groups them by session, dataset, and modality:
${session_key}/ ${dataset_key}/ ${logfile_name} ${modality_key}/ ${datafile_name}
A typical logfile path has this form:
2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ stereo_camera-zed-2023-09-02-15-06-07.bag
A typical datafile path has this form:
2023-09_starnberg_lake/ 2023-09-02-15-06_hydraulic_drill/ zed_right/ 1693660038_368077993.jpg
All directory and file names, and their particles, are designed to serve as identifiers in the metadatabase. Their formatting, as well as the definitions of all terms, are documented in the file entities.json.
Metadatabase
The metadatabase is provided in two equivalent forms:
as a standalone SQLite (https://www.sqlite.org/index.html) database file metadata.sqlite for users familiar with SQLite,
as a collection of CSV files in the csv/ directory for users who prefer other tools.
The database file has been generated from the CSV files, so each database table holds the same information as the corresponding CSV file. In addition, the metadatabase contains a series of convenience views that facilitate access to certain aggregate information.
An entity relationship diagram of the metadatabase tables is stored in the file entity_relationship_diagram.png. Each entity, its attributes, and relations are documented in detail in the file entities.json
Some general design remarks:
For convenience, timestamps are always given in both a human-readable form (ISO 8601 formatted datetime strings with explicit local time zone), and as seconds since the UNIX epoch.
In practice, each logfile always contains a single stream, and each stream is stored always in a single logfile. Per database schema however, the entities stream and logfile are modeled separately, with a “many-streams-to-one-logfile” relationship. This design was chosen to be compatible with, and open for, data collections where a single logfile contains multiple streams.
A modality is not an attribute of a sensor alone, but of a datafile: Because a sensor is an attribute of a stream, and a single stream may be the source of multiple modalities (e.g. RGB vs. grayscale images from the same camera, or cartesian vs. polar projection of the same sonar output). Conversely, the same modality may originate from different sensors.
As a usage example, the data volume per session which is tabulated at the top of this document, can be extracted from the metadatabase with the following SQL query:
SELECT PRINTF( '%s - %s', SUBSTR(session_start, 1, 10), SUBSTR(session_end, 1, 10)) AS 'Session dates', location_name_english AS Location, number_of_datasets AS 'Number of datasets', total_duration_of_datasets_h AS 'Total duration of datasets [h]', total_logfile_size_gb AS 'Total logfile size [GB]', number_of_images AS 'Number of images', total_image_size_gb AS 'Total image size [GB]' FROM location JOIN session USING (location_id) JOIN ( SELECT session_id, COUNT(dataset_id) AS number_of_datasets, ROUND( SUM(dataset_duration) / 3600, 1) AS total_duration_of_datasets_h, ROUND( SUM(total_logfile_size) / 10e9, 1) AS total_logfile_size_gb FROM location JOIN session USING (location_id) JOIN dataset USING (session_id) JOIN view_dataset_total_logfile_size USING (dataset_id) GROUP BY session_id ) USING (session_id) JOIN ( SELECT session_id, COUNT(datafile_id) AS number_of_images, ROUND(SUM(datafile_size) / 10e9, 1) AS total_image_size_gb FROM session JOIN dataset USING (session_id) JOIN stream USING (dataset_id) JOIN datafile USING (stream_id) GROUP BY session_id ) USING (session_id) ORDER BY session_id;
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the SQL Query Audit Tools market size reached USD 1.26 billion in 2024, reflecting robust adoption across multiple industries. The market is projected to expand at a CAGR of 13.2% from 2025 to 2033, culminating in a forecasted market value of USD 3.69 billion by 2033. This substantial growth trajectory is primarily driven by the escalating demand for robust database security and compliance solutions in an era marked by increasingly stringent data privacy regulations and an upsurge in cyber threats targeting sensitive business information.
One of the most significant growth factors for the SQL Query Audit Tools market is the rising complexity and volume of enterprise data. Organizations across sectors are generating and handling massive amounts of structured and unstructured data, necessitating advanced auditing mechanisms to ensure data integrity, compliance, and security. The proliferation of digital transformation initiatives, cloud migration, and the adoption of big data analytics have further underscored the need for sophisticated tools capable of auditing SQL queries in real-time. These tools not only help organizations identify suspicious activities and unauthorized access but also play a pivotal role in maintaining regulatory compliance, especially in highly regulated industries such as banking, financial services, and healthcare.
The growing regulatory landscape is another key driver propelling the SQL Query Audit Tools market. Governments and regulatory bodies worldwide have introduced stringent data protection laws such as GDPR, HIPAA, and CCPA, compelling organizations to implement comprehensive audit trails for all database activities. SQL query audit tools offer granular visibility into database transactions, enabling companies to demonstrate compliance and avoid hefty fines associated with non-compliance. Furthermore, as cyberattacks become more sophisticated, organizations are increasingly recognizing the value of proactive monitoring and auditing solutions that can detect anomalies, prevent data breaches, and support forensic investigations in the event of security incidents.
Technological advancements and the integration of artificial intelligence and machine learning into SQL query audit tools are also fueling market expansion. Modern solutions are leveraging AI-driven analytics to automate anomaly detection, streamline compliance reporting, and enhance the accuracy of security alerts. Additionally, the shift towards cloud-based deployments is making these tools more accessible to small and medium enterprises (SMEs), which historically faced barriers due to high upfront costs and resource constraints. The combination of technological innovation, regulatory pressure, and the increasing importance of data governance is expected to sustain the strong growth momentum of the SQL Query Audit Tools market in the coming years.
Regionally, North America currently dominates the SQL Query Audit Tools market, accounting for the largest share in 2024, followed by Europe and the Asia Pacific. The United States, in particular, is witnessing significant adoption driven by the presence of large enterprises, advanced IT infrastructure, and a highly regulated business environment. Europe is also experiencing robust growth, fueled by stringent data protection regulations and increasing investments in cybersecurity solutions. Meanwhile, the Asia Pacific region is poised for the fastest growth over the forecast period, supported by rapid digitalization, expanding IT and telecommunications sectors, and rising awareness about data security among enterprises in emerging economies such as China and India.
The Component segment of the SQL Query Audit Tools market is bifurcated into software and services, each playing a critical role in the overall ecosystem. Software solutions form the backbone of the market, encompassing standalone audit tools, integrated database management platforms, and advanced analytics engines. These software offerings are designed to monitor, log, and analyze SQL queries in real-time, providing detailed audit trails and actionable insights for security, compliance, and performance optimization. The demand for feature-rich, scalable, and user-friendly software is on the rise as organizations seek to automate audit processes and minimize manual intervention.
Facebook
Twitter
According to our latest research, the global database performance monitoring market size reached USD 2.47 billion in 2024. The market is experiencing robust expansion, driven by the increasing complexity of database environments and the critical need for real-time data access and analytics. With a compound annual growth rate (CAGR) of 13.2% from 2025 to 2033, the market is projected to reach USD 6.41 billion by 2033. The surge in digital transformation initiatives, cloud migration, and the proliferation of data-intensive applications are among the key factors propelling this growth trajectory.
One of the primary growth drivers for the database performance monitoring market is the exponential rise in data generation across industries. Organizations are increasingly leveraging advanced analytics, artificial intelligence, and machine learning, which require high-performing, reliable databases to deliver actionable insights in real time. As enterprises adopt multi-cloud and hybrid environments, the challenges of managing and monitoring database performance intensify, necessitating sophisticated monitoring solutions. These solutions offer proactive identification and resolution of performance bottlenecks, ensuring business continuity and optimal user experiences. The emphasis on digital agility and operational efficiency further underscores the importance of investing in robust database performance monitoring tools.
Another significant factor contributing to market growth is the evolving regulatory landscape and the need for compliance across sectors such as BFSI, healthcare, and government. Regulatory requirements around data integrity, security, and availability have made database monitoring indispensable for organizations aiming to avoid costly downtime and potential penalties. As cyber threats become more sophisticated, database performance monitoring solutions play a crucial role in detecting anomalies, preventing data breaches, and maintaining compliance with global standards such as GDPR, HIPAA, and PCI DSS. The integration of advanced features like predictive analytics, automated troubleshooting, and real-time alerting further enhances the value proposition of these solutions, making them a vital component of modern IT infrastructure.
The market is also being shaped by the rapid adoption of cloud-based database solutions. As enterprises migrate their workloads to public, private, and hybrid clouds, the need for cloud-native monitoring capabilities becomes paramount. Cloud-based database performance monitoring tools offer scalability, flexibility, and seamless integration with diverse cloud platforms, enabling organizations to manage complex, distributed environments efficiently. The shift towards DevOps and agile development practices has also accelerated the demand for continuous monitoring and performance optimization throughout the application lifecycle. This trend is particularly pronounced among small and medium enterprises, which are leveraging cloud-based solutions to compete with larger players and drive innovation.
Regionally, North America continues to dominate the database performance monitoring market, accounting for the largest market share in 2024. The regionÂ’s leadership is attributed to the high concentration of technology-driven enterprises, early adoption of advanced IT solutions, and substantial investments in cloud infrastructure. Europe and Asia Pacific are also witnessing significant growth, fueled by increasing digitalization, expanding IT budgets, and the emergence of new business models. In particular, Asia Pacific is expected to exhibit the highest CAGR during the forecast period, driven by rapid economic development, rising internet penetration, and a burgeoning startup ecosystem. The competitive landscape is characterized by the presence of global and regional players, each striving to enhance their offerings through innovation and strategic partnerships.
As organizations strive to optimize their database environments, SQL Performance Tuning Tools have become indispensable. These tools are designed to enhance the efficiency of SQL queries, reduce response times, and improve overall database performance. By analyzing query execution plans and identifying bottlenecks, SQL Performance Tuning Tools enable database administrato
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface
- This dataset can be used to develop natural language interfaces for relational databases.
- This dataset can be used to develop a knowledge base of common SQL queries.
- This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.