Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.
I’ve decided to host different light/medium size by using PostgreSQL, MySQL and SQL Server backends (in strict descending order of preference!).
Why 3 database backends? I think there are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.
Please send me a tweet if you need the connection strings for your lectures or workshops. My Twitter username is @pachamaltese. See the SQL dumps on each section to have the data locally.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
It contains the following files:
- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license
The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.
This dataset is distributed under the CC BY-SA 4.0 license.
If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.
@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}
@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}
@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}
@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}
@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}
@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}
@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}
@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}
@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}
@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Search API for looking up addresses and roads within the catchment. The api can search for both address and road, or either. This dataset is updated weekly from VicMap Roads and Addresses, sourced via www.data.vic.gov.au.
The Search API uses a data.gov.au datastore and allows a user to take full advantage of full test search functionality.
An sql attribute is passed to the URL to define the query against the API. Please note that the attribute must be URL encoded. The sql statement takes for form as below:
SELECT distinct display, x, y
FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a"
WHERE _full_text @@ to_tsquery(replace('[term]', ' ', ' %26 '))
LIMIT 10
The above will select the top 10 results from the API matching the input 'term', and return the display name as well as an x and y coordinate.
The full URL for the above query would be:
https://data.gov.au/api/3/action/datastore_search_sql?sql=SELECT display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('[term]', ' ', ' %26 ')) LIMIT 10)
Any field in the source dataset can be returned via the API. Display, x and y are used in the example above, but any other field can be returned by altering the select component of the sql statement. See examples below.
Search data sources and LGA can also be used to filter results. When not using a filter, the API defaults to using all records. See examples below.
A filter can be applied to select for a particular source dataset using the 'src' field. The currently available datasets are as follows:
Filters can be applied to select for a specific local government area using the 'lga_code' field. LGA codes are derrived from Vicmap LGA datasets. Wimmeras LGAs include:
Search for the top 10 addresses and roads with the word 'darlot' in their names:
SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('darlot', ' ', ' & ')) LIMIT 10)
Search for all roads with the word 'perkins' in their names:
SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('perkins', ' ', ' %26 ')) AND src=1
Search for all addresses with the word 'kalimna' in their names, within Horsham Rural City Council:
SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('kalimna', ' ', ' %26 ')) AND src=2 and lga_code=332
Search for the top 10 addresses and roads with the word 'green' in their names, returning just their display name, locality, x and y:
SELECT distinct display, locality, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('green', ' ', ' %26 ')) LIMIT 10
Search all addresses in Hindmarsh Shire:
SELECT distinct display, locality, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE lga_code=330
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".
The appendix contains:
The queries from the three open source systems we used in the evaluation of our tool (the industry software system is not part of this appendix, due to privacy reasons)
The results of our evaluation.
The source code of the tool. Most recent version can be found at https://github.com/SERG-Delft/evosql.
The results of the tuning procedure we conducted before running the final evaluation.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Image generated by DALL-E. See prompt for more details
synthetic_text_to_sql
gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:
105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
Objective: To enhance the accuracy of information retrieval from pharmacovigilance (PV) databases by employing Large Language Models (LLMs) to convert natural language queries (NLQs) into Structured Query Language (SQL) queries, leveraging a business context document. Materials and Methods: We utilized OpenAI’s GPT-4 model within a retrieval-augmented generation (RAG) framework, enriched with a business context document, to transform NLQs into executable SQL queries. Each NLQ was presented to the LLM randomly and independently to prevent memorization. The study was conducted in three phases, varying query complexity, and assessing the LLM's performance both with and without the business context document. Results: Our approach significantly improved NLQ-to-SQL accuracy, increasing from 8.3% with the database schema alone to 78.3% with the business context document. This enhancement was consistent across low, medium, and high complexity queries, indicating the critical role of contextual ..., Test set of NLQ's used in the paper Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL. Also included are the Python scripts for the LLM processing, the R code for statistical analysis of results, and a copy of the business context document and essential tables., , # Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL
https://doi.org/10.5061/dryad.2280gb63n
NLQ_Queries.xls contains the set of test NLQs along with the results of the LLM response in each phase of the experiment. Each NLQ also contains the complexity scores computed for each.
The business context document is supplied as a PDF, together with the Python and R code used to generate our results. The essential tables used in Phase 2 and 3 of the experiment are included in the text file.
Description:Â Contains all NLQ queries with the results of the LLM output and the pass, fail status of each.
Column Definitions:
Below are the column names in order with a detailed description.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset created as part of the Master Thesis "Business Intelligence – Automation of Data Marts modeling and its data processing".
Lucerne University of Applied Sciences and Arts
Master of Science in Applied Information and Data Science (MScIDS)
Autumn Semester 2022
Change log Version 1.1:
The following SQL scripts were added:
Index
Type
Name
1
View
pg.dictionary_table
2
View
pg.dictionary_column
3
View
pg.dictionary_relation
4
View
pg.accesslayer_table
5
View
pg.accesslayer_column
6
View
pg.accesslayer_relation
7
View
pg.accesslayer_fact_candidate
8
Stored Procedure
pg.get_fact_candidate
9
Stored Procedure
pg.get_dimension_candidate
10
Stored Procedure
pg.get_columns
Scripts are based on Microsoft SQL Server Version 2017 and compatible with a data warehouse built with Datavault Builder. Data warehouse objects scripts of the sample data warehouse are restricted and cannot be shared.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.
Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects
Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.
This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is derived from the AdventureWorks 2014 test database published by Microsoft, and is designed to simplify and enhance data analysis workflows. The dataset consists of multiple CSV files that have been pre-joined and transformed from the original SQL database, facilitating a smoother analytical experience in Python.
The dataset includes: * SalesOrderHeader: Integrates the sales header and sales item tables, providing a unified view of sales transactions. * CustomerMaster: Combines customer names, countries, addresses, and other related information into a single, comprehensive file. * VendorMaster: Combines vendor names, countries, addresses, and other related information into a single, comprehensive file.
These pre-joined CSVs aim to streamline data analysis, making it more accessible for users working in Python. The dataset can be used to showcase various Python projects or as a foundation for your own analyses.
Feel free to leverage this dataset for your data analysis projects, explore trends, and create visualizations. Whether you're showcasing your own Python projects or conducting independent analyses, this dataset is designed to support a wide range of data science tasks.
For those interested in recreating the CSV files from the SQL database, detailed documentation is included at the bottom of this section. It provides step-by-step instructions on how to replicate the CSVs from the AdventureWorks 2014 database using SQL queries.
SELECT
SalesOrderID
, CAST (OrderDate AS date) AS OrderDate
, CAST (ShipDate AS date) AS ShipDate
, CustomerID
, ShipToAddressID
, BillToAddressID
, SubTotal
, TaxAmt
, Freight
, TotalDue
FROM
Sales.SalesOrderHeader
SELECT
pa.AddressID
, pbea.BusinessEntityID
, pa.AddressLine1
, pa.City
, pa.PostalCode
, psp.[Name] AS ProvinceStateName
, pat.[Name] AS AddressType
, pea.EmailAddress
, ppp.PhoneNumber
, pp.FirstName
, pp.LastName
, sst.CountryRegionCode
, pcr.[Name] AS CountryName
, sst.[Group] AS CountryGroup
FROM
Person.[Address] AS pa
INNER JOIN
Person.BusinessEntityAddress AS pbea ON pa.AddressID = pbea.AddressID
INNER JOIN
Person.StateProvince AS psp ON pa.StateProvinceID = psp.StateProvinceID
INNER JOIN
Person.AddressType AS pat ON pbea.AddressTypeID = pat.AddressTypeID
INNER JOIN
Person.EmailAddress AS pea ON pbea.BusinessEntityID = pea.BusinessEntityID
INNER JOIN
Person.Person AS pp ON pbea.BusinessEntityID = pp.BusinessEntityID
INNER JOIN
Person.PersonPhone AS ppp ON pbea.BusinessEntityID = ppp.BusinessEntityID
INNER JOIN
Sales.SalesTerritory AS sst ON psp.TerritoryID = sst.TerritoryID
INNER JOIN
Person.CountryRegion AS pcr ON sst.CountryRegionCode = pcr.CountryRegionCode;
https://data.ferndalemi.gov/datasets/565974970d8848f2a80c6eaee4242bbc_2/license.jsonhttps://data.ferndalemi.gov/datasets/565974970d8848f2a80c6eaee4242bbc_2/license.json
This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset has SQL injection attacks as malicious flow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLmap tool has been used.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description
This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)
This table contains information about customers, including their unique identifiers and demographic details.
Columns:
cst_id: Customer ID (Primary Key)
cst_gndr: Gender
cst_marital_status: Marital status
cst_create_date: Customer account creation date
Cleaning Steps:
Removed duplicates and handled missing or null cst_id values.
Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.
Standardized gender values and identified inconsistencies in marital status.
This table contains information about products, including product identifiers, names, costs, and lifecycle dates.
Columns:
prd_id: Product ID
prd_key: Product key
prd_nm: Product name
prd_cost: Product cost
prd_start_dt: Product start date
prd_end_dt: Product end date
Cleaning Steps:
Checked for duplicates and null values in the prd_key column.
Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.
Corrected product costs to remove invalid entries (e.g., negative values).
This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.
Columns:
sls_order_dt: Sales order date
sls_due_dt: Sales due date
sls_sales: Total sales amount
sls_quantity: Number of products sold
sls_price: Product unit price
Cleaning Steps:
Validated sales order dates and corrected invalid entries.
Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.
Removed null and negative values from sls_sales, sls_quantity, and sls_price.
This table contains additional customer demographic data, including gender and birthdate.
Columns:
cid: Customer ID
gen: Gender
bdate: Birthdate
Cleaning Steps:
Checked for missing or null gender values and standardized inconsistent entries.
Removed leading/trailing spaces from gen and bdate.
Validated birthdates to ensure they were within a realistic range.
This table contains country information related to the customers' locations.
Columns:
cntry: Country
Cleaning Steps:
Standardized country names (e.g., "US" and "USA" were mapped to "United States").
Removed special characters (e.g., carriage returns) and trimmed whitespace.
This table contains product category information.
Columns:
Product category data (no significant cleaning required).
Key Features:
Customer demographics, including gender and marital status
Product details such as cost, start date, and end date
Sales data with order dates, quantities, and sales amounts
ERP-specific customer and location data
Data Cleaning Process:
This dataset underwent extensive cleaning and validation, including:
Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).
Date Validations: Ensuring correct date ranges and chronological consistency.
Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.
Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.
This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks
Column Name | Type | Description |
---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks
Column Name | Type | Description |
---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final |
This dataset contains sensitive data that has not been disclosed in the online version of the Domestic Electrical Load Survey (DELS) 1994-2014 dataset. In contrast to the DELS dataset, the DELS Secure Data contains partially anonymised survey responses with only the names of respondents and home owners removed. The DELSS contains street and postal addresses, as well as GPS level location data for households from 2000 onwards. The GPS data is obtained through an auxiliary dataset, the Site Reference database. Like the DELS, the DELSS dataset has been retrieved and anonymised from the original SQL database with the python package delretrieve.
The study had national coverage.
Households and individuals
The survey covers electrified households that received electricity either directly from Eskom or from their local municipality. Particular attention was devoted to rural and low income households, as well as surveying households electrified over a range of years, thus having had access to electricity from recent times to several decades.
Sample survey data
See sampling procedure for DELS 1994-2014
Face-to-face [f2f]
This dataset has been produced by extracting only the survey responses from the original NRS Load Research SQL database using the saveAnswers function from the delretrieve python package (https://github.com/wiebket/delretrieve: release v1.0). Full instructions on how to use delretrieve to extract data are in the README file contained in the package.
PARTIAL DE-IDENTIFICATION Partial de-identification was done in the process of extracting the data from the SQL database with the delretrieve package. Only the names of respondents and home owners have been removed from the survey responses by replacing responses with an 'a' in the dataset. Documents with full details of the variables that have been anonymised are included as external resources.
MISSING VALUES Other than partial de-identification no post-processing was done and all database records, including missing values, are stored exactly as retrieved.
See notes on data quality for DELS 1994-2014
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Building’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/e21c9e38-783a-4155-b7e3-cefe8a02136e on 26 January 2022.
--- Dataset description provided by original source is as follows ---
This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Parcel collector’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/23f2a097-00e0-44ce-9eb1-c79232471121 on 26 January 2022.
--- Dataset description provided by original source is as follows ---
This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.
--- Original source retains full ownership of the source dataset ---
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset description of the “ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri”
Prof. Dr. Isabelle Marthot-Santaniello, Dr. Olga Serbaeva
2024.09.16
Introduction
The present dataset stems from the ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri (original links to the competition are provided in the file “1b.CompetitionLinks.”)
The aim of this competition was to investigate the performance of glyph detection and recognition in a very challenging type of historical document: Greek papyri. The detection and recognition of Greek letters on papyri is a preliminary step for computational analysis of handwriting that can lead to major steps forward in our understanding of this important source of information on Antiquity. Such detection and recognition can be done manually by trained papyrologists. It is, however, a time-consuming task that would need automatising.
We provide here the documents related to two different tasks: localisation and classification. The document images are provided by several institutions and are representative of the diversity of book hands on papyri (a millennium time span, various script styles, provenance, states of preservation, means of digitization and resolution).
How the dataset was constructed
In the frame of D-Scribes project lead by Prof. Dr. Isabelle Marthot-Santaniello, 2018-2023, around 150 papyri fragments containing Iliad were manually annotated at a letter-level in READ.
The editions were taken, for the major part, from papyri.info, and were simplified, i.e. the accents, editorial marks, and other additional information were removed to be as close as possible to what is to be found on papyri. When the text was not available on papyri.info, the relevant passage was extracted from the Homer Iliad of Perseus.
From those, 150 plus papyri fragments, 185 surfaces (sides of fragments) belonging to 136 different manuscript identified by their Trismegistos numbers, (further TMs) were selected to serve as a material for Competition. These 185 surfaces were separated into the “training set” and the “test set” provided for the competition as a set of images and corresponding data in JSON format.
Details on the competition summarised in "ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri", by Mathias Seuret, Isabelle Marthot-Santaniello, Stephen A. White, Olga Serbaeva Saraogi, Selaudin Agolli, Guillaume Carrière, Dalia Rodriguez-Salas, and Vincent Christlein; edited by G. A. Fink et al. (Eds.): ICDAR 2023, LNCS 14188, pp. 498–507, 2023. https://doi.org/10.1007/978-3-031-41679-8_29.
After the competition ended, the decision was taken to release manually annotated dataset for the “test set” as well. Please find the description of each included document below.
Dataset Structure
“1. CompetitionOverview.xlsx” contains the metadata of the used images in Excel file, state 2024.09.19. Here is the structure of the Excel file:
Excel columns
Name
Content
Notes
A
TM
Trismegistos number is internationally used for papyri identification
With READ item name in ().
B
Papyri.info link
link
C
Fragments' Owning Institution (from papyri.info)
Institution’s name
Institution that physically stores the papyri
D
Availability (of metadata, papyri.info)
link
Metadata reuse clarification
E
text ID (READ)
Number from READ SQL database that was used to link the images and the editions.
Serves to locate the attached images and understand the JSON structure.
F
Test/Training
I.e. the image was originally included in the training or in the test set of the dataset.
G
Image Name (for orientation)
As in READ
H
Cedopal link
link
Contains additional metadata and includes the links to all available online images.
I
License from the Institution webpage.
Either license or usage summary.
If no precise licence has been given, the summary of the reuse rights is provided with a link to the regulations in column K
J
Image URL
link
Not all images are available online. Please contact the owning institution directly if the image is not available.
K
Information on the image usage from the institution
link
In case of any doubt, please contact the owning institution directly.
L
Notes
For the purpose of an easy overview, the items with special problems, i.e. images not online or missing links, have been marked in red.
2a. “Training file” (containing 150 papyri images separated into 108 texts and HomerCompTraining.json). The images are those of papyri containing Iliad of Homer in JPG-format. These were processed in READ, namely, each visible letter on a given papyri was linked to the edition of the Iliad, through this process, each linked letter of the edition was linked to its coordinates in pixels on the HTML-surface of the image. All that information is provided in the JSON-file.
The JSON file contains the “annotations” (b-boxes of each letter/sign), “categories” (Greek letters), “images” (Image IDs), and “licenses”. The links between image and bboxes is defined via the “id” in the “images” part (for example, "id": 6109). This same id is encoded as “"image_id": 6109” in the “annotations”. Alternatively, “text_id” which can be found in the “images” URL and in the file-names provided here and containing images, can be used for data linking.
Let us now describe the content of each part of the JSON file:Each “annotation” contains“area" characterised as “bbox" with coordinates, “category_id”, that allows to identify which Greek letter in categories is represented by the number; “id”, which is a unique number of the cliplet, i.e. area; “image_id”, that links cliplet to the surface of the image having the same id; “iscrowd" and “seg_id" are useful to find the information back in READ database; and, finally, “tags”.
In tags, “BaseType" was used to annotate quality as described below. “FootMarkType”, ft1, etc., was used for clustering tests, but played no role for the Competition.“BaseType” ot bt-tags were assigned to the letters to mark the quality of preservation: bt-1: well-preserved letter that should allows easy identification for both human eyes and the Computer-vision; bt-2: Partially preserved letter that might also have some background damage (holes, additional ink, etc), but remains readable, and has one interpretation. bt-3: Letters damaged to such an extant that they cannot be identified without reading an edition. These are treated as traces of ink. bt-4: The letters that have some damage, but this damage is of such kind that it makes possible multiple interpretations. For example, missing/defaced horizontal stroke makes alpha indistinguishable from damaged delta or lambda.
Each “category” contains “id”, this is a number references also in “annotations” and it allows to identify which Greek letter was in the bbox; ”name”, for example, “χ”; and “supercategory”, i.e. “Greek”.
Each “image” contains the following sub fields: “bln_id" is an internal READ number of the html surface; "date_captured": null - is another READ field; "file_name": “./images/homer2/txt1/P.Corn.Inv.MSS.A.101.XIII.jpg", allows to link easy image and text, i.e. for the image in question the JPG will be in the file called “txt1”, it is very similar by structure and function to "img_url": "./images/homer2/txt1/P.Corn.Inv.MSS.A.101.XIII.jpg"; each image has “height" and “width" expressed in pixels. Each image has “id”, and this id is referenced in the “annotations” under “image_id”. Finally, each image contains a link to “license”, expressed as a number.
Each “licence” lists a license as it was found during the time of competition, i.e. in February 2023.
2b. “Test file” contains 34 papyri image sides separated into 31 TMs and HomerCompTesting.json The JSON file here only allows to connect the images with the “categories”, “images”, “licenses”, but without the “annotations”. The structure and logic is otherwise the same like in “Training” JSON.
2c. “Answers file” Containing the “annotations” and other information for the 34 papyri of the “Testing” dataset. The structure and logic is the same like in “Training” JSON.
“Additional files” Containing lists of duplicate segments id (multiple possible readings or tags), respectively 6 items for “Training”, 17 for “Testing” and 15 for “Answers”.
“Dataset Description”This same description included for completeness.
References
The Dataset was reused or mentioned in a number of publications (state September 2024)
Mohammed, H., Jampour, M. (2024). "From Detection to Modelling: An End-to-End Paleographic System for Analysing Historical Handwriting Styles". In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham, pp. 363–376. https://doi.org/10.1007/978-3-031-70442-0_22
De Gregorio, G., Perrin, S., Pena, R.C.G., Marthot-Santaniello, I., Mouchère, H. (2024). "NeuroPapyri: A Deep Attention Embedding Network for Handwritten Papyri Retrieval". In: Mouchère, H., Zhu, A. (eds) Document Analysis and Recognition – ICDAR 2024 Workshops. ICDAR 2024. Lecture Notes in Computer Science, vol 14936. Springer, Cham, pp. 71–86. https://doi.org/10.1007/978-3-031-70642-4_5
Vu, M. T., Beurton-Aimar, M. "PapyTwin net: a Twin network for Greek letters detection on ancient Papyri". HIP '23: 7th International Workshop on Historical Document Imaging and Processing, San Jose, CA, USA, August
When police punch, pepper spray or use other force against someone in New Jersey, they are required to fill out a form detailing what happened. NJ Advance Media filed 506 public records requests and received 72,607 forms covering 2012 through 2016. For more data collection details, see our Methodology here. Data cleaning details can be found here.
We then cleaned, analyzed and compiled the data by department to get a better look at what departments were using the most force, what type of force they were using, and who they were using it on. The result, our searchable database, can be found at NJ.com/force. But we wanted to make department-level results — our aggregate data — available in another way to the broader public.
For more details on individual columns, see the data dictionary for UOF_BY_DEPARTMENTS. We have also created sample SQL queries to make it easy for users to quickly find their town or county.
It's important to note that these forms were self-reported by police officers, sometimes filled out by hand, so even our data cleaning can't totally prevent inaccuracies from cropping up. We've also included comparisons to population data (from the Census) and arrest data (from the FBI Uniform Crime Report), to try to help give context to what you're seeing.
We have included individual incidents on each department page, but we are not publishing the form-level data freely to the public. Not only is that data extremely dirty and difficult to analyze — at least, it took us six months — but it contains private information about subjects of force, including minors and people with mental health issues. However, we are planning to make a version of that file available upon request in the future.
What are rows? What are incidents?
Every time any police officer uses force against a subject, they must fill out a form detailing what happened and what force they used. But sometimes multiple police officers used force against the same subject in the same incident. "Rows" are individual forms officers filled out, "incidents" are unique incidents based on the incident number and date.
What are the odds ratios, and how did you calculate them?
We wanted a simple way of showing readers the disparity between black and white subjects in a particular town. So we used an odds ratio, a statistical method often used in research to compare the odds of one thing happening to another. For population, the calculation was (Number of black subjects/Total black population of area)/(Number of white subjects/Total white population of area). For arrests, the calculation was (Number of black subjects/Total number of black arrests in area)/(Number of white subjects/Total number of white arrests in area). In addition, when we compared anything to arrests, we took out all incidents where the subject was an EDP (emotionally disturbed person).
What are the NYC/LA/Chicago warning systems?
Those three departments each look at use of force to flag officers if they show concerning patterns, as way to select those that could merit more training or other action by the department. We compared our data to those three systems to see how many officers would trigger the early warning systems for each. Here are the three systems:
- In New York City, officers are flagged for review if they use higher levels of force — including a baton, Taser or firearm, but not pepper spray — or if anyone was injured or hospitalized. We calculated this number by identifying every officer who met one or more of the criteria.
- In Los Angeles, officers are compared with one another based on 14 variables, including use of force. If an officer ranks significantly higher than peers for any of the variables — technically, 3 standards of deviation from the norm — supervisors are automatically notified. We calculated this number conservatively by using only use of force as a variable over the course of a calendar year.
- In Chicago, officers are flagged for review if force results in an injury or hospitalization, or if the officer uses any level of force above punches or kicks. We calculated this number by identifying every officer who met one or more of the criteria.
What are the different levels of force?
Each officer was required to include in the form what type of force they used against a subject. We cleaned and standardized the data to major categories, although officers could write-in a different type of force if they wanted to. Here are the major categories:
- Compliance hold: A compliance hold is a painful maneuver using pressure points to gain control over a suspect. It is the lowest level of force and the most commonly used. But it is often used in conjunction with other types of force.
- Takedown: This technique is used to bring a suspect to the ground and eventually onto their stomach to cuff them. It can be a leg sweep or a tackle.
- Hands/fist: Open hands or closed fist strikes/punches.
- Leg strikes: Leg strikes are any kick or knee used on a subject.
- Baton: Officers are trained to use a baton when punches or kicks are unsuccessful.
- Pepper spray: Police pepper spray, a mist derived from the resin of cayenne pepper, is considered “mechanical force” under state guidelines.
- Deadly force: The firing of an officer's service weapon, regardless of whether a subject was hit. “Warning shots” are prohibited, and officers are instructed not to shoot just to maim or subdue a suspect.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘BIZ INFOUSA’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/ffb9f9b4-8a49-4e30-b6a0-67be780fe82b on 26 January 2022.
--- Dataset description provided by original source is as follows ---
This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.
--- Original source retains full ownership of the source dataset ---
Small subset of the SuperCOSMOS Science Archive, useful for testing queries. The SuperCOSMOS data held in the SSA primarily originate from scans of Palomar and UK Schmidt blue, red and near-IR southern sky surveys. The ESO Schmidt R (dec -17.5) surveys have also been scanned and provide a 1st epoch red measurement. Further details on the surveys, the scanning process and the raw parameters extracted can be found on the further information link. The SSA is housed in a relational database running on Microsoft SQL Server 2000. Data are stored in tables which are inter-linked via reference ID numbers. In addition to the astronomical object catalogues these tables also contain information on the plates that were scanned, survey field centres and calibration coefficients. Most user science queries will only need to access the SOURCE table or to a lesser extent the DETECTION table. Detection table: cone search of detections from all plate measurements in all bands Source table: cone search of single band merged source catalog Access to two applications: general ADQL query, and asynchronous cone-search where relevant/enabled.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.
I’ve decided to host different light/medium size by using PostgreSQL, MySQL and SQL Server backends (in strict descending order of preference!).
Why 3 database backends? I think there are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.
Please send me a tweet if you need the connection strings for your lectures or workshops. My Twitter username is @pachamaltese. See the SQL dumps on each section to have the data locally.