36 datasets found

SQL Databases for Students and Educators
zenodo.org
data.niaid.nih.gov
bin, html
Updated Oct 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mauricio Vargas Sepúlveda; Mauricio Vargas Sepúlveda (2020). SQL Databases for Students and Educators [Dataset]. http://doi.org/10.5281/zenodo.4136985
Explore at:
bin, htmlAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4136985
Dataset updated
Oct 28, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mauricio Vargas Sepúlveda; Mauricio Vargas Sepúlveda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.

I’ve decided to host different light/medium size by using PostgreSQL, MySQL and SQL Server backends (in strict descending order of preference!).

Why 3 database backends? I think there are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.

Please send me a tweet if you need the connection strings for your lectures or workshops. My Twitter username is @pachamaltese. See the SQL dumps on each section to have the data locally.
Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL
zenodo.org
bin, json, txt
Updated Aug 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
Explore at:
txt, json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5205322
Dataset updated
Aug 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
d
Wimmera CMA Search API
data.gov.au
cloud.csiss.gmu.edu
+1more
csv, pdf
Updated Aug 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wimmera CMA (2023). Wimmera CMA Search API [Dataset]. https://data.gov.au/data/dataset/wimmera-cma-search-api
Explore at:
pdf(156048), csv(32611124)Available download formats
Dataset updated
Aug 13, 2023
Dataset authored and provided by
Wimmera CMA
License
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Description
Search API for looking up addresses and roads within the catchment. The api can search for both address and road, or either. This dataset is updated weekly from VicMap Roads and Addresses, sourced via www.data.vic.gov.au.

Use

The Search API uses a data.gov.au datastore and allows a user to take full advantage of full test search functionality.

An sql attribute is passed to the URL to define the query against the API. Please note that the attribute must be URL encoded. The sql statement takes for form as below:

SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('[term]', ' ', ' %26 ')) LIMIT 10

The above will select the top 10 results from the API matching the input 'term', and return the display name as well as an x and y coordinate.

The full URL for the above query would be:

https://data.gov.au/api/3/action/datastore_search_sql?sql=SELECT display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('[term]', ' ', ' %26 ')) LIMIT 10)

Fields

Any field in the source dataset can be returned via the API. Display, x and y are used in the example above, but any other field can be returned by altering the select component of the sql statement. See examples below.

Filters

Search data sources and LGA can also be used to filter results. When not using a filter, the API defaults to using all records. See examples below.

Source Dataset

A filter can be applied to select for a particular source dataset using the 'src' field. The currently available datasets are as follows:

1 for Roads

2 for Address

3 for Localities

4 for Parcels (CREF and SPI)

5 for Localities (Propnum)

Local Government Area

Filters can be applied to select for a specific local government area using the 'lga_code' field. LGA codes are derrived from Vicmap LGA datasets. Wimmeras LGAs include:

332 Horsham Rural City Council

330 Hindmarsh Shire Council

357 Northern Grampians Shire Council

371 West Wimmera Shire Council

378 Yarriambiack Shire Council

Examples

Search for the top 10 addresses and roads with the word 'darlot' in their names:

SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('darlot', ' ', ' & ')) LIMIT 10)

example

Search for all roads with the word 'perkins' in their names:

SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('perkins', ' ', ' %26 ')) AND src=1

example

Search for all addresses with the word 'kalimna' in their names, within Horsham Rural City Council:

SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('kalimna', ' ', ' %26 ')) AND src=2 and lga_code=332

example

Search for the top 10 addresses and roads with the word 'green' in their names, returning just their display name, locality, x and y:

SELECT distinct display, locality, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('green', ' ', ' %26 ')) LIMIT 10

example

Search all addresses in Hindmarsh Shire:

SELECT distinct display, locality, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE lga_code=330

example
Z
Search-Based Test Data Generation for SQL Queries: Appendix
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maurício Aniche (2020). Search-Based Test Data Generation for SQL Queries: Appendix [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1166022
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Arie van Deursen
Jeroen Castelein
Mozhan Soltani
Annibale Panichella
Maurício Aniche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".

The appendix contains:

The queries from the three open source systems we used in the evaluation of our tool (the industry software system is not part of this appendix, due to privacy reasons)

The results of our evaluation.

The source code of the tool. Most recent version can be found at https://github.com/SERG-Delft/evosql.

The results of the tuning procedure we conducted before running the final evaluation.
h
synthetic_text_to_sql
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Gretel.ai
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Image generated by DALL-E. See prompt for more details

synthetic_text_to_sql

gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.
d
Data from: Automating pharmacovigilance evidence generation: Using large...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Painter; Venkateswara Chalamalasetti; Raymond Kassekert; Andrew Bate (2025). Automating pharmacovigilance evidence generation: Using large language models to produce context-aware SQL [Dataset]. http://doi.org/10.5061/dryad.2280gb63n
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.2280gb63n
Dataset updated
Feb 4, 2025
Dataset provided by
Dryad Digital Repository
Authors
Jeffery Painter; Venkateswara Chalamalasetti; Raymond Kassekert; Andrew Bate
Description
Objective: To enhance the accuracy of information retrieval from pharmacovigilance (PV) databases by employing Large Language Models (LLMs) to convert natural language queries (NLQs) into Structured Query Language (SQL) queries, leveraging a business context document. Materials and Methods: We utilized OpenAIâ€™s GPT-4 model within a retrieval-augmented generation (RAG) framework, enriched with a business context document, to transform NLQs into executable SQL queries. Each NLQ was presented to the LLM randomly and independently to prevent memorization. The study was conducted in three phases, varying query complexity, and assessing the LLM's performance both with and without the business context document. Results: Our approach significantly improved NLQ-to-SQL accuracy, increasing from 8.3% with the database schema alone to 78.3% with the business context document. This enhancement was consistent across low, medium, and high complexity queries, indicating the critical role of contextual ..., Test set of NLQ's used in the paper Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL. Also included are the Python scripts for the LLM processing, the R code for statistical analysis of results, and a copy of the business context document and essential tables., , # Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL

https://doi.org/10.5061/dryad.2280gb63n

Description of the data and file structure

NLQ_Queries.xls contains the set of test NLQs along with the results of the LLM response in each phase of the experiment. Each NLQ also contains the complexity scores computed for each.

The business context document is supplied as a PDF, together with the Python and R code used to generate our results. The essential tables used in Phase 2 and 3 of the experiment are included in the text file.

Files and variables

File: NLQ_Queries.xlsx

Description:Â Contains all NLQ queries with the results of the LLM output and the pass, fail status of each.

Column Definitions:

Below are the column names in order with a detailed description.

User NLQ: Plain text database query

Phase_1:Â Pass or Fail status indicator "Pass, Partial, or Fa...
Z
Sample Dataset - HR Subject Areas
data.niaid.nih.gov
Updated Jan 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Weber, Marc (2023). Sample Dataset - HR Subject Areas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7447111
Explore at:
Dataset updated
Jan 18, 2023
Dataset authored and provided by
Weber, Marc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset created as part of the Master Thesis "Business Intelligence – Automation of Data Marts modeling and its data processing".

Lucerne University of Applied Sciences and Arts

Master of Science in Applied Information and Data Science (MScIDS)

Autumn Semester 2022

Change log Version 1.1:

The following SQL scripts were added:

Index Type Name 1 View pg.dictionary_table 2 View pg.dictionary_column 3 View pg.dictionary_relation 4 View pg.accesslayer_table 5 View pg.accesslayer_column 6 View pg.accesslayer_relation 7 View pg.accesslayer_fact_candidate 8 Stored Procedure pg.get_fact_candidate 9 Stored Procedure pg.get_dimension_candidate 10 Stored Procedure pg.get_columns

Scripts are based on Microsoft SQL Server Version 2017 and compatible with a data warehouse built with Datavault Builder. Data warehouse objects scripts of the sample data warehouse are restricted and cannot be shared.
McKinsey Solve Assessment Data (2018–2025)
kaggle.com
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oluwademilade Adeniyi (2025). McKinsey Solve Assessment Data (2018–2025) [Dataset]. http://doi.org/10.34740/kaggle/dsv/11720554
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/11720554
Dataset updated
May 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Oluwademilade Adeniyi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.

📌 Inspiration & Purpose

Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects

Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.

🔍 Dataset Source

Data generated by Oluwademilade Adeniyi (Demibolt) with the assistance of ChatGPT by OpenAI Structure and logic inspired by McKinsey’s public-facing Solve information, including role categories, game types (Ecosystem, Redrock, Seawolf), education tiers, and global office locations The entire dataset is synthetic and designed for analytical learning, ethical use, and professional development

🧾 Dataset Structure

This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%

✅ Why Use This Dataset

Benchmark educational and regional trends in global assessments

Build KPI cards, donut charts, histograms, or speedometer visuals

Train pass/fail classifiers or regression models

Segment job applicants by role, location, or game behaviour

Showcase portfolio skills across Excel, SQL, Power BI, Python, or R

Test dashboards or predictive logic in a business-relevant scenario

💡 Credit & Collaboration

Data Creator: Oluwademilade Adeniyi (Me) (LinkedIn, Twitter, GitHub, Medium)

Collaborator: ChatGPT by OpenAI

Inspired by: McKinsey & Company’s Solve Assessment
AdventureWorks-2014
kaggle.com
Updated Aug 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick McKown (2024). AdventureWorks-2014 [Dataset]. https://www.kaggle.com/datasets/duckduckboot/adventureworks-2014
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Patrick McKown
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
About This Dataset

This dataset is derived from the AdventureWorks 2014 test database published by Microsoft, and is designed to simplify and enhance data analysis workflows. The dataset consists of multiple CSV files that have been pre-joined and transformed from the original SQL database, facilitating a smoother analytical experience in Python.

Dataset Composition

The dataset includes: * SalesOrderHeader: Integrates the sales header and sales item tables, providing a unified view of sales transactions. * CustomerMaster: Combines customer names, countries, addresses, and other related information into a single, comprehensive file. * VendorMaster: Combines vendor names, countries, addresses, and other related information into a single, comprehensive file.

These pre-joined CSVs aim to streamline data analysis, making it more accessible for users working in Python. The dataset can be used to showcase various Python projects or as a foundation for your own analyses.

Usage

Feel free to leverage this dataset for your data analysis projects, explore trends, and create visualizations. Whether you're showcasing your own Python projects or conducting independent analyses, this dataset is designed to support a wide range of data science tasks.

Documentation

For those interested in recreating the CSV files from the SQL database, detailed documentation is included at the bottom of this section. It provides step-by-step instructions on how to replicate the CSVs from the AdventureWorks 2014 database using SQL queries.

AdventureWorks_SalesOrderHeader

SELECT SalesOrderID , CAST (OrderDate AS date) AS OrderDate , CAST (ShipDate AS date) AS ShipDate , CustomerID , ShipToAddressID , BillToAddressID , SubTotal , TaxAmt , Freight , TotalDue FROM Sales.SalesOrderHeader

AdventureWorks_CustomerMaster

SELECT pa.AddressID , pbea.BusinessEntityID , pa.AddressLine1 , pa.City , pa.PostalCode , psp.[Name] AS ProvinceStateName , pat.[Name] AS AddressType , pea.EmailAddress , ppp.PhoneNumber , pp.FirstName , pp.LastName , sst.CountryRegionCode , pcr.[Name] AS CountryName , sst.[Group] AS CountryGroup FROM Person.[Address] AS pa INNER JOIN Person.BusinessEntityAddress AS pbea ON pa.AddressID = pbea.AddressID INNER JOIN Person.StateProvince AS psp ON pa.StateProvinceID = psp.StateProvinceID INNER JOIN Person.AddressType AS pat ON pbea.AddressTypeID = pat.AddressTypeID INNER JOIN Person.EmailAddress AS pea ON pbea.BusinessEntityID = pea.BusinessEntityID INNER JOIN Person.Person AS pp ON pbea.BusinessEntityID = pp.BusinessEntityID INNER JOIN Person.PersonPhone AS ppp ON pbea.BusinessEntityID = ppp.BusinessEntityID INNER JOIN Sales.SalesTerritory AS sst ON psp.TerritoryID = sst.TerritoryID INNER JOIN Person.CountryRegion AS pcr ON sst.CountryRegionCode = pcr.CountryRegionCode;
W
Parcel collector
cloud.csiss.gmu.edu
detroitdata.org
+3more
csv, esri rest +4
Updated Sep 21, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2018). Parcel collector [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/parcel-collector
Explore at:
csv, esri rest, kml, html, geojson, zipAvailable download formats
Dataset updated
Sep 21, 2018
Dataset provided by
United States
License
https://data.ferndalemi.gov/datasets/565974970d8848f2a80c6eaee4242bbc_2/license.jsonhttps://data.ferndalemi.gov/datasets/565974970d8848f2a80c6eaee4242bbc_2/license.json
Description
This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.
The original source for these layers are:
Business Data: InfoUSA business database purchased by DDP in 2017
Building Data: Detroit Building Footprint data
Parcel Data: from Detroit Open Data Portal, download in May 2018.
For field research by Tian, some fields have been added and some records in building and business have been edited.
For business data, Tian confirmed most of public assessable businesses and deleted those which do not exist. Also, Tian add new Business to the business data if it did not exist on the record.
For building data, Tian recorded the total business space for each building, not-empty business space, occupancy status, parking adjacency status, and took picture for every building in downtown Detroit.
Detail field META DATA:
InfoUSA Business
OBJECTID_1
COMPANY_NA: company name
ADDRESS: company address
CITY: city
STATE: state
ZIP_CODE: zip code
MAILING_CA: source InfoUSA
MAILING_DE source InfoUSA
LOCATION_A source InfoUSA: address
LOCATION_1 source InfoUSA: city
LOCATION_2 source InfoUSA: state
LOCATION_3 source InfoUSA: zip code
LOCATION_4source InfoUSA
LOCATION_5 source InfoUSA
COUNTY: county
PHONE_NUMB: phone number
WEB_ADDRES: website address
LAST_NAME: contact last name
FIRST_NAME: contact first name
CONTACT_TI: contact type
CONTACT_PR:
CONTACT_GE: contact gender
ACTUAL_EMP: employee number
EMPLOYEE_S: employee number class
ACTUAL_SAL: actual sale
SALES_VOLU: sales value
PRIMARY_SI: primary sales value
PRIMARY_1: primary classification
SECONDARY_: secondary classification
SECONDARY1
SECONDAR_1
SECONDAR_2
CREDIT_ALP: credit level
CREDIT_NUM: credit number
HEADQUARTE: headquarte
YEAR_1ST_A: year open
OFFICE_SIZ: office size
SQUARE_FOO: square foot
FIRM_INDIV:
PUBLIC_PRI
Fleet_size
FRANCHISE_
FRANCHISE1
INDUSTRY_S
ADSIZE_IN_
METRO_AREA
INFOUSA_ID
LATITUDE: y
LONGITUDE: x
PARKING: parking adjacency
NAICS_CODE: NAICS CODE
NAICS_DESC: NAICS DESCRIPTION
parcelnum*: PARCEL NUMBER
parcelobji* PARCEL OBJECT ID
CHECK_*
ACCESSIABLE* PUBLIC ACCESSIBILITY
PROPMANAGER* PROPERTY MANAGER
GlobalID
Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018
Building
OBJECTID_12
BUILDING_I: building id
PARCEL_ID : parcel id
BUILD_TYPE: building type
CITY_ID：city id
APN: parcel number
RES_SQFT: Res square feet
NONRES_SQF non-res square feet
YEAR_BUILT: year built
YEAR_DEMO
HOUSING_UN: housing units
STORIES: # of stories
MEDIAN_HGT: median height
CONDITION: building condition
HAS_CONDOS: has condos or not
FLAG_SQFT: flag square feet
FLAG_YEAR_: flag year
FLAG_CONDI: flag condition
LOADD1: address number
HIADD1 (type: esriFieldTypeInteger, alias: HIADD1, SQL Type: sqlTypeOther, nullable: true, editable: true)
STREET1: street name
LOADD2:
HIADD2 (type: esriFieldTypeString, alias: HIADD2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
STREET2 (type: esriFieldTypeString, alias: STREET2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
ZIPCODE: zip code
AKA: building name
USE_LOCATO
TEMP (type: esriFieldTypeString, alias: TEMP, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
SPID (type: esriFieldTypeInteger, alias: SPID, SQL Type: sqlTypeOther, nullable: true, editable: true)
Zone (type: esriFieldTypeString, alias: Zone, SQL Type: sqlTypeOther, length: 60, nullable: true, editable: true)
F7_2SqMile (type: esriFieldTypeString, alias: F7_2SqMile, SQL Type: sqlTypeOther, length: 10, nullable: true, editable: true)
Shape_Leng (type: esriFieldTypeDouble, alias: Shape_Leng, SQL Type: sqlTypeOther, nullable: true, editable: true)
PARKING*: parking adjacency
OCCUPANCY*: occupied or not
BuildingType* : building type
TotalBusinessSpace*: available business space in this building
NonEmptySpace*: non-empty business space in this building
CHECK_*
FOLLOWUP*: need followup or not
GlobalID*
PropmMana*: property manager
Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018
Z
SQL Injection Test (D3)
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Dec 28, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrián Campazas (2021). SQL Injection Test (D3) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5806299
Explore at:
Dataset updated
Dec 28, 2021
Dataset provided by
Ignacio Crespo
Adrián Campazas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset has SQL injection attacks as malicious flow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLmap tool has been used.

Cleaned Retail Customer Dataset (SQL-based ETL)

kaggle.com

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Rizwan Bin Akbar (2025). Cleaned Retail Customer Dataset (SQL-based ETL) [Dataset]. https://www.kaggle.com/datasets/rizwanbinakbar/cleaned-retail-customer-dataset-sql-based-etl/versions/2

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 3, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rizwan Bin Akbar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Dataset Description

This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)

This table contains information about customers, including their unique identifiers and demographic details.

Columns:

  cst_id: Customer ID (Primary Key)

  cst_gndr: Gender

  cst_marital_status: Marital status

  cst_create_date: Customer account creation date

Cleaning Steps:

  Removed duplicates and handled missing or null cst_id values.

  Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.

  Standardized gender values and identified inconsistencies in marital status.

Product Information (s_crm_prd_info / b_crm_prd_info)

This table contains information about products, including product identifiers, names, costs, and lifecycle dates.

Columns:

  prd_id: Product ID

  prd_key: Product key

  prd_nm: Product name

  prd_cost: Product cost

  prd_start_dt: Product start date

  prd_end_dt: Product end date

Cleaning Steps:

  Checked for duplicates and null values in the prd_key column.

  Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.

  Corrected product costs to remove invalid entries (e.g., negative values).

Sales Details (s_crm_sales_details / b_crm_sales_details)

This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.

Columns:

  sls_order_dt: Sales order date

  sls_due_dt: Sales due date

  sls_sales: Total sales amount

  sls_quantity: Number of products sold

  sls_price: Product unit price

Cleaning Steps:

  Validated sales order dates and corrected invalid entries.

  Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.

  Removed null and negative values from sls_sales, sls_quantity, and sls_price.

ERP Customer Data (b_erp_cust_az12, s_erp_cust_az12)

This table contains additional customer demographic data, including gender and birthdate.

Columns:

  cid: Customer ID

  gen: Gender

  bdate: Birthdate

Cleaning Steps:

  Checked for missing or null gender values and standardized inconsistent entries.

  Removed leading/trailing spaces from gen and bdate.

  Validated birthdates to ensure they were within a realistic range.

Location Information (b_erp_loc_a101)

This table contains country information related to the customers' locations.

Columns:

  cntry: Country

Cleaning Steps:

  Standardized country names (e.g., "US" and "USA" were mapped to "United States").

  Removed special characters (e.g., carriage returns) and trimmed whitespace.

Product Category (b_erp_px_cat_g1v2)

This table contains product category information.

Columns:

  Product category data (no significant cleaning required).

Key Features:

Customer demographics, including gender and marital status

Product details such as cost, start date, and end date

Sales data with order dates, quantities, and sales amounts

ERP-specific customer and location data

Data Cleaning Process:

This dataset underwent extensive cleaning and validation, including:

Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).

Date Validations: Ensuring correct date ranges and chronological consistency.

Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.

Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.

This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa
datafirst.uct.ac.za
Updated Jun 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eskom (2019). Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa [Dataset]. https://www.datafirst.uct.ac.za/dataportal/index.php/catalog/757
Explore at:
Dataset updated
Jun 20, 2019
Dataset provided by
Eskomhttp://www.eskom.co.za/
University of Cape Town
Stellenbosch University
Time period covered
1995 - 2014
Area covered
South Africa
Description
Abstract

This dataset contains sensitive data that has not been disclosed in the online version of the Domestic Electrical Load Survey (DELS) 1994-2014 dataset. In contrast to the DELS dataset, the DELS Secure Data contains partially anonymised survey responses with only the names of respondents and home owners removed. The DELSS contains street and postal addresses, as well as GPS level location data for households from 2000 onwards. The GPS data is obtained through an auxiliary dataset, the Site Reference database. Like the DELS, the DELSS dataset has been retrieved and anonymised from the original SQL database with the python package delretrieve.

Geographic coverage

The study had national coverage.

Analysis unit

Households and individuals

Universe

The survey covers electrified households that received electricity either directly from Eskom or from their local municipality. Particular attention was devoted to rural and low income households, as well as surveying households electrified over a range of years, thus having had access to electricity from recent times to several decades.

Kind of data

Sample survey data

Sampling procedure

See sampling procedure for DELS 1994-2014

Mode of data collection

Face-to-face [f2f]

Cleaning operations

This dataset has been produced by extracting only the survey responses from the original NRS Load Research SQL database using the saveAnswers function from the delretrieve python package (https://github.com/wiebket/delretrieve: release v1.0). Full instructions on how to use delretrieve to extract data are in the README file contained in the package.

PARTIAL DE-IDENTIFICATION Partial de-identification was done in the process of extracting the data from the SQL database with the delretrieve package. Only the names of respondents and home owners have been removed from the survey responses by replacing responses with an 'a' in the dataset. Documents with full details of the variables that have been anonymised are included as external resources.

MISSING VALUES Other than partial de-identification no post-processing was done and all database records, including missing values, are stored exactly as retrieved.

Data appraisal

See notes on data quality for DELS 1994-2014
A
‘Building’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2018). ‘Building’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-building-6cf5/a7a262a1/?iid=040-219&v=presentation
Explore at:
Dataset updated
Aug 30, 2018
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Building’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/e21c9e38-783a-4155-b7e3-cefe8a02136e on 26 January 2022.

--- Dataset description provided by original source is as follows ---

This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.
The original source for these layers are:
Business Data: InfoUSA business database purchased by DDP in 2017
Building Data: Detroit Building Footprint data
Parcel Data: from Detroit Open Data Portal, download in May 2018.
For field research by Tian, some fields have been added and some records in building and business have been edited.
For business data, Tian confirmed most of public assessable businesses and deleted those which do not exist. Also, Tian add new Business to the business data if it did not exist on the record.
For building data, Tian recorded the total business space for each building, not-empty business space, occupancy status, parking adjacency status, and took picture for every building in downtown Detroit.
Detail field META DATA:
InfoUSA Business
OBJECTID_1
COMPANY_NA: company name
ADDRESS: company address
CITY: city
STATE: state
ZIP_CODE: zip code
MAILING_CA: source InfoUSA
MAILING_DE source InfoUSA
LOCATION_A source InfoUSA: address
LOCATION_1 source InfoUSA: city
LOCATION_2 source InfoUSA: state
LOCATION_3 source InfoUSA: zip code
LOCATION_4source InfoUSA
LOCATION_5 source InfoUSA
COUNTY: county
PHONE_NUMB: phone number
WEB_ADDRES: website address
LAST_NAME: contact last name
FIRST_NAME: contact first name
CONTACT_TI: contact type
CONTACT_PR:
CONTACT_GE: contact gender
ACTUAL_EMP: employee number
EMPLOYEE_S: employee number class
ACTUAL_SAL: actual sale
SALES_VOLU: sales value
PRIMARY_SI: primary sales value
PRIMARY_1: primary classification
SECONDARY_: secondary classification
SECONDARY1
SECONDAR_1
SECONDAR_2
CREDIT_ALP: credit level
CREDIT_NUM: credit number
HEADQUARTE: headquarte
YEAR_1ST_A: year open
OFFICE_SIZ: office size
SQUARE_FOO: square foot
FIRM_INDIV:
PUBLIC_PRI
Fleet_size
FRANCHISE_
FRANCHISE1
INDUSTRY_S
ADSIZE_IN_
METRO_AREA
INFOUSA_ID
LATITUDE: y
LONGITUDE: x
PARKING: parking adjacency
NAICS_CODE: NAICS CODE
NAICS_DESC: NAICS DESCRIPTION
parcelnum*: PARCEL NUMBER
parcelobji* PARCEL OBJECT ID
CHECK_*
ACCESSIABLE* PUBLIC ACCESSIBILITY
PROPMANAGER* PROPERTY MANAGER
GlobalID
Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018
Building
OBJECTID_12
BUILDING_I: building id
PARCEL_ID : parcel id
BUILD_TYPE: building type
CITY_ID：city id
APN: parcel number
RES_SQFT: Res square feet
NONRES_SQF non-res square feet
YEAR_BUILT: year built
YEAR_DEMO
HOUSING_UN: housing units
STORIES: # of stories
MEDIAN_HGT: median height
CONDITION: building condition
HAS_CONDOS: has condos or not
FLAG_SQFT: flag square feet
FLAG_YEAR_: flag year
FLAG_CONDI: flag condition
LOADD1: address number
HIADD1 (type: esriFieldTypeInteger, alias: HIADD1, SQL Type: sqlTypeOther, nullable: true, editable: true)
STREET1: street name
LOADD2:
HIADD2 (type: esriFieldTypeString, alias: HIADD2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
STREET2 (type: esriFieldTypeString, alias: STREET2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
ZIPCODE: zip code
AKA: building name
USE_LOCATO
TEMP (type: esriFieldTypeString, alias: TEMP, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
SPID (type: esriFieldTypeInteger, alias: SPID, SQL Type: sqlTypeOther, nullable: true, editable: true)
Zone (type: esriFieldTypeString, alias: Zone, SQL Type: sqlTypeOther, length: 60, nullable: true, editable: true)
F7_2SqMile (type: esriFieldTypeString, alias: F7_2SqMile, SQL Type: sqlTypeOther, length: 10, nullable: true, editable: true)
Shape_Leng (type: esriFieldTypeDouble, alias: Shape_Leng, SQL Type: sqlTypeOther, nullable: true, editable: true)
PARKING*: parking adjacency
OCCUPANCY*: occupied or not
BuildingType* : building type
TotalBusinessSpace*: available business space in this building
NonEmptySpace*: non-empty business space in this building
CHECK_*
FOLLOWUP*: need followup or not
GlobalID*
PropmMana*: property manager
Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018

--- Original source retains full ownership of the source dataset ---
A
‘Parcel collector’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Parcel collector’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-parcel-collector-b20c/fdf9e6d7/?iid=035-740&v=presentation
Explore at:
Dataset updated
Jan 26, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Parcel collector’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/23f2a097-00e0-44ce-9eb1-c79232471121 on 26 January 2022.

--- Dataset description provided by original source is as follows ---

This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.
The original source for these layers are:
Business Data: InfoUSA business database purchased by DDP in 2017
Building Data: Detroit Building Footprint data
Parcel Data: from Detroit Open Data Portal, download in May 2018.
For field research by Tian, some fields have been added and some records in building and business have been edited.
For business data, Tian confirmed most of public assessable businesses and deleted those which do not exist. Also, Tian add new Business to the business data if it did not exist on the record.
For building data, Tian recorded the total business space for each building, not-empty business space, occupancy status, parking adjacency status, and took picture for every building in downtown Detroit.
Detail field META DATA:
InfoUSA Business
OBJECTID_1
COMPANY_NA: company name
ADDRESS: company address
CITY: city
STATE: state
ZIP_CODE: zip code
MAILING_CA: source InfoUSA
MAILING_DE source InfoUSA
LOCATION_A source InfoUSA: address
LOCATION_1 source InfoUSA: city
LOCATION_2 source InfoUSA: state
LOCATION_3 source InfoUSA: zip code
LOCATION_4source InfoUSA
LOCATION_5 source InfoUSA
COUNTY: county
PHONE_NUMB: phone number
WEB_ADDRES: website address
LAST_NAME: contact last name
FIRST_NAME: contact first name
CONTACT_TI: contact type
CONTACT_PR:
CONTACT_GE: contact gender
ACTUAL_EMP: employee number
EMPLOYEE_S: employee number class
ACTUAL_SAL: actual sale
SALES_VOLU: sales value
PRIMARY_SI: primary sales value
PRIMARY_1: primary classification
SECONDARY_: secondary classification
SECONDARY1
SECONDAR_1
SECONDAR_2
CREDIT_ALP: credit level
CREDIT_NUM: credit number
HEADQUARTE: headquarte
YEAR_1ST_A: year open
OFFICE_SIZ: office size
SQUARE_FOO: square foot
FIRM_INDIV:
PUBLIC_PRI
Fleet_size
FRANCHISE_
FRANCHISE1
INDUSTRY_S
ADSIZE_IN_
METRO_AREA
INFOUSA_ID
LATITUDE: y
LONGITUDE: x
PARKING: parking adjacency
NAICS_CODE: NAICS CODE
NAICS_DESC: NAICS DESCRIPTION
parcelnum*: PARCEL NUMBER
parcelobji* PARCEL OBJECT ID
CHECK_*
ACCESSIABLE* PUBLIC ACCESSIBILITY
PROPMANAGER* PROPERTY MANAGER
GlobalID
Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018
Building
OBJECTID_12
BUILDING_I: building id
PARCEL_ID : parcel id
BUILD_TYPE: building type
CITY_ID：city id
APN: parcel number
RES_SQFT: Res square feet
NONRES_SQF non-res square feet
YEAR_BUILT: year built
YEAR_DEMO
HOUSING_UN: housing units
STORIES: # of stories
MEDIAN_HGT: median height
CONDITION: building condition
HAS_CONDOS: has condos or not
FLAG_SQFT: flag square feet
FLAG_YEAR_: flag year
FLAG_CONDI: flag condition
LOADD1: address number
HIADD1 (type: esriFieldTypeInteger, alias: HIADD1, SQL Type: sqlTypeOther, nullable: true, editable: true)
STREET1: street name
LOADD2:
HIADD2 (type: esriFieldTypeString, alias: HIADD2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
STREET2 (type: esriFieldTypeString, alias: STREET2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
ZIPCODE: zip code
AKA: building name
USE_LOCATO
TEMP (type: esriFieldTypeString, alias: TEMP, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
SPID (type: esriFieldTypeInteger, alias: SPID, SQL Type: sqlTypeOther, nullable: true, editable: true)
Zone (type: esriFieldTypeString, alias: Zone, SQL Type: sqlTypeOther, length: 60, nullable: true, editable: true)
F7_2SqMile (type: esriFieldTypeString, alias: F7_2SqMile, SQL Type: sqlTypeOther, length: 10, nullable: true, editable: true)
Shape_Leng (type: esriFieldTypeDouble, alias: Shape_Leng, SQL Type: sqlTypeOther, nullable: true, editable: true)
PARKING*: parking adjacency
OCCUPANCY*: occupied or not
BuildingType* : building type
TotalBusinessSpace*: available business space in this building
NonEmptySpace*: non-empty business space in this building
CHECK_*
FOLLOWUP*: need followup or not
GlobalID*
PropmMana*: property manager
Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018

--- Original source retains full ownership of the source dataset ---
Z
Data from: "ICDAR2023 Competition on Detection and Recognition of Greek...
data.niaid.nih.gov
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serbaeva, Olga (2024). "ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri" Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13825618
Explore at:
Dataset updated
Sep 24, 2024
Dataset provided by
Agolli, Selaudin
Seuret, Mathias
Rodriguez-Salas, Dalia
Christlein, Vincent
Carrière, Guillaume
White, Stephen
Marthot-Santaniello, Isabelle
Serbaeva, Olga
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset description of the “ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri”

Prof. Dr. Isabelle Marthot-Santaniello, Dr. Olga Serbaeva

2024.09.16

Introduction

The present dataset stems from the ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri (original links to the competition are provided in the file “1b.CompetitionLinks.”)

The aim of this competition was to investigate the performance of glyph detection and recognition in a very challenging type of historical document: Greek papyri. The detection and recognition of Greek letters on papyri is a preliminary step for computational analysis of handwriting that can lead to major steps forward in our understanding of this important source of information on Antiquity. Such detection and recognition can be done manually by trained papyrologists. It is, however, a time-consuming task that would need automatising.

We provide here the documents related to two different tasks: localisation and classification. The document images are provided by several institutions and are representative of the diversity of book hands on papyri (a millennium time span, various script styles, provenance, states of preservation, means of digitization and resolution).

How the dataset was constructed

In the frame of D-Scribes project lead by Prof. Dr. Isabelle Marthot-Santaniello, 2018-2023, around 150 papyri fragments containing Iliad were manually annotated at a letter-level in READ.

The editions were taken, for the major part, from papyri.info, and were simplified, i.e. the accents, editorial marks, and other additional information were removed to be as close as possible to what is to be found on papyri. When the text was not available on papyri.info, the relevant passage was extracted from the Homer Iliad of Perseus.

From those, 150 plus papyri fragments, 185 surfaces (sides of fragments) belonging to 136 different manuscript identified by their Trismegistos numbers, (further TMs) were selected to serve as a material for Competition. These 185 surfaces were separated into the “training set” and the “test set” provided for the competition as a set of images and corresponding data in JSON format.

Details on the competition summarised in "ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri", by Mathias Seuret, Isabelle Marthot-Santaniello, Stephen A. White, Olga Serbaeva Saraogi, Selaudin Agolli, Guillaume Carrière, Dalia Rodriguez-Salas, and Vincent Christlein; edited by G. A. Fink et al. (Eds.): ICDAR 2023, LNCS 14188, pp. 498–507, 2023. https://doi.org/10.1007/978-3-031-41679-8_29.

After the competition ended, the decision was taken to release manually annotated dataset for the “test set” as well. Please find the description of each included document below.

Dataset Structure

“1. CompetitionOverview.xlsx” contains the metadata of the used images in Excel file, state 2024.09.19. Here is the structure of the Excel file:

Excel columns

Name

Content

Notes

A

TM

Trismegistos number is internationally used for papyri identification

With READ item name in ().

B

Papyri.info link

link

C

Fragments' Owning Institution (from papyri.info)

Institution’s name

Institution that physically stores the papyri

D

Availability (of metadata, papyri.info)

link

Metadata reuse clarification

E

text ID (READ)

Number from READ SQL database that was used to link the images and the editions.

Serves to locate the attached images and understand the JSON structure.

F

Test/Training

I.e. the image was originally included in the training or in the test set of the dataset.

G

Image Name (for orientation)

As in READ

H

Cedopal link

link

Contains additional metadata and includes the links to all available online images.

I

License from the Institution webpage.

Either license or usage summary.

If no precise licence has been given, the summary of the reuse rights is provided with a link to the regulations in column K

J

Image URL

link

Not all images are available online. Please contact the owning institution directly if the image is not available.

K

Information on the image usage from the institution

link

In case of any doubt, please contact the owning institution directly.

L

Notes

For the purpose of an easy overview, the items with special problems, i.e. images not online or missing links, have been marked in red.

There are three data subsets:

2a. “Training file” (containing 150 papyri images separated into 108 texts and HomerCompTraining.json). The images are those of papyri containing Iliad of Homer in JPG-format. These were processed in READ, namely, each visible letter on a given papyri was linked to the edition of the Iliad, through this process, each linked letter of the edition was linked to its coordinates in pixels on the HTML-surface of the image. All that information is provided in the JSON-file.

The JSON file contains the “annotations” (b-boxes of each letter/sign), “categories” (Greek letters), “images” (Image IDs), and “licenses”. The links between image and bboxes is defined via the “id” in the “images” part (for example, "id": 6109). This same id is encoded as “"image_id": 6109” in the “annotations”. Alternatively, “text_id” which can be found in the “images” URL and in the file-names provided here and containing images, can be used for data linking.

Let us now describe the content of each part of the JSON file:Each “annotation” contains“area" characterised as “bbox" with coordinates, “category_id”, that allows to identify which Greek letter in categories is represented by the number; “id”, which is a unique number of the cliplet, i.e. area; “image_id”, that links cliplet to the surface of the image having the same id; “iscrowd" and “seg_id" are useful to find the information back in READ database; and, finally, “tags”.

In tags, “BaseType" was used to annotate quality as described below. “FootMarkType”, ft1, etc., was used for clustering tests, but played no role for the Competition.“BaseType” ot bt-tags were assigned to the letters to mark the quality of preservation: bt-1: well-preserved letter that should allows easy identification for both human eyes and the Computer-vision; bt-2: Partially preserved letter that might also have some background damage (holes, additional ink, etc), but remains readable, and has one interpretation. bt-3: Letters damaged to such an extant that they cannot be identified without reading an edition. These are treated as traces of ink. bt-4: The letters that have some damage, but this damage is of such kind that it makes possible multiple interpretations. For example, missing/defaced horizontal stroke makes alpha indistinguishable from damaged delta or lambda.

Each “category” contains “id”, this is a number references also in “annotations” and it allows to identify which Greek letter was in the bbox; ”name”, for example, “χ”; and “supercategory”, i.e. “Greek”.

Each “image” contains the following sub fields: “bln_id" is an internal READ number of the html surface; "date_captured": null - is another READ field; "file_name": “./images/homer2/txt1/P.Corn.Inv.MSS.A.101.XIII.jpg", allows to link easy image and text, i.e. for the image in question the JPG will be in the file called “txt1”, it is very similar by structure and function to "img_url": "./images/homer2/txt1/P.Corn.Inv.MSS.A.101.XIII.jpg"; each image has “height" and “width" expressed in pixels. Each image has “id”, and this id is referenced in the “annotations” under “image_id”. Finally, each image contains a link to “license”, expressed as a number.

Each “licence” lists a license as it was found during the time of competition, i.e. in February 2023.

2b. “Test file” contains 34 papyri image sides separated into 31 TMs and HomerCompTesting.json The JSON file here only allows to connect the images with the “categories”, “images”, “licenses”, but without the “annotations”. The structure and logic is otherwise the same like in “Training” JSON.

2c. “Answers file” Containing the “annotations” and other information for the 34 papyri of the “Testing” dataset. The structure and logic is the same like in “Training” JSON.

“Additional files” Containing lists of duplicate segments id (multiple possible readings or tags), respectively 6 items for “Training”, 17 for “Testing” and 15 for “Answers”.

“Dataset Description”This same description included for completeness.

References

The Dataset was reused or mentioned in a number of publications (state September 2024)

Mohammed, H., Jampour, M. (2024). "From Detection to Modelling: An End-to-End Paleographic System for Analysing Historical Handwriting Styles". In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham, pp. 363–376. https://doi.org/10.1007/978-3-031-70442-0_22

De Gregorio, G., Perrin, S., Pena, R.C.G., Marthot-Santaniello, I., Mouchère, H. (2024). "NeuroPapyri: A Deep Attention Embedding Network for Handwritten Papyri Retrieval". In: Mouchère, H., Zhu, A. (eds) Document Analysis and Recognition – ICDAR 2024 Workshops. ICDAR 2024. Lecture Notes in Computer Science, vol 14936. Springer, Cham, pp. 71–86. https://doi.org/10.1007/978-3-031-70642-4_5

Vu, M. T., Beurton-Aimar, M. "PapyTwin net: a Twin network for Greek letters detection on ancient Papyri". HIP '23: 7th International Workshop on Historical Document Imaging and Processing, San Jose, CA, USA, August
d
Use of Force department data
data.world
csv, zip
Updated Mar 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NJ Advance Data Team (2024). Use of Force department data [Dataset]. https://data.world/njdotcom/use-of-force-department-data
Explore at:
csv, zipAvailable download formats
Dataset updated
Mar 8, 2024
Authors
NJ Advance Data Team
Description
This is five years of police use of force data for all 468 New Jersey municipal police departments and the New Jersey State Police compiled by NJ Advance Media for The Force Report.

When police punch, pepper spray or use other force against someone in New Jersey, they are required to fill out a form detailing what happened. NJ Advance Media filed 506 public records requests and received 72,607 forms covering 2012 through 2016. For more data collection details, see our Methodology here. Data cleaning details can be found here.

We then cleaned, analyzed and compiled the data by department to get a better look at what departments were using the most force, what type of force they were using, and who they were using it on. The result, our searchable database, can be found at NJ.com/force. But we wanted to make department-level results — our aggregate data — available in another way to the broader public.

Below you'll find two files:

UOF_BY_DEPARTMENTS.csv, with every department's summary data, including the State Police. (This is important to note because the State Police patrols multiple towns and may not be comparable to other departments.)

UOF_STATEWIDE.csv, a statewide summary of the same data.

For more details on individual columns, see the data dictionary for UOF_BY_DEPARTMENTS. We have also created sample SQL queries to make it easy for users to quickly find their town or county.

It's important to note that these forms were self-reported by police officers, sometimes filled out by hand, so even our data cleaning can't totally prevent inaccuracies from cropping up. We've also included comparisons to population data (from the Census) and arrest data (from the FBI Uniform Crime Report), to try to help give context to what you're seeing.

What about the form-level data?

We have included individual incidents on each department page, but we are not publishing the form-level data freely to the public. Not only is that data extremely dirty and difficult to analyze — at least, it took us six months — but it contains private information about subjects of force, including minors and people with mental health issues. However, we are planning to make a version of that file available upon request in the future.

Data analysis FAQ

What are rows? What are incidents?
Every time any police officer uses force against a subject, they must fill out a form detailing what happened and what force they used. But sometimes multiple police officers used force against the same subject in the same incident. "Rows" are individual forms officers filled out, "incidents" are unique incidents based on the incident number and date.

What are the odds ratios, and how did you calculate them?
We wanted a simple way of showing readers the disparity between black and white subjects in a particular town. So we used an odds ratio, a statistical method often used in research to compare the odds of one thing happening to another. For population, the calculation was (Number of black subjects/Total black population of area)/(Number of white subjects/Total white population of area). For arrests, the calculation was (Number of black subjects/Total number of black arrests in area)/(Number of white subjects/Total number of white arrests in area). In addition, when we compared anything to arrests, we took out all incidents where the subject was an EDP (emotionally disturbed person).

What are the NYC/LA/Chicago warning systems?
Those three departments each look at use of force to flag officers if they show concerning patterns, as way to select those that could merit more training or other action by the department. We compared our data to those three systems to see how many officers would trigger the early warning systems for each. Here are the three systems: - In New York City, officers are flagged for review if they use higher levels of force — including a baton, Taser or firearm, but not pepper spray — or if anyone was injured or hospitalized. We calculated this number by identifying every officer who met one or more of the criteria. - In Los Angeles, officers are compared with one another based on 14 variables, including use of force. If an officer ranks significantly higher than peers for any of the variables — technically, 3 standards of deviation from the norm — supervisors are automatically notified. We calculated this number conservatively by using only use of force as a variable over the course of a calendar year. - In Chicago, officers are flagged for review if force results in an injury or hospitalization, or if the officer uses any level of force above punches or kicks. We calculated this number by identifying every officer who met one or more of the criteria.

What are the different levels of force?
Each officer was required to include in the form what type of force they used against a subject. We cleaned and standardized the data to major categories, although officers could write-in a different type of force if they wanted to. Here are the major categories: - Compliance hold: A compliance hold is a painful maneuver using pressure points to gain control over a suspect. It is the lowest level of force and the most commonly used. But it is often used in conjunction with other types of force. - Takedown: This technique is used to bring a suspect to the ground and eventually onto their stomach to cuff them. It can be a leg sweep or a tackle. - Hands/fist: Open hands or closed fist strikes/punches. - Leg strikes: Leg strikes are any kick or knee used on a subject. - Baton: Officers are trained to use a baton when punches or kicks are unsuccessful. - Pepper spray: Police pepper spray, a mist derived from the resin of cayenne pepper, is considered “mechanical force” under state guidelines. - Deadly force: The firing of an officer's service weapon, regardless of whether a subject was hit. “Warning shots” are prohibited, and officers are instructed not to shoot just to maim or subdue a suspect.
A
‘BIZ INFOUSA’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘BIZ INFOUSA’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-biz-infousa-dd76/02222eb0/?iid=042-438&v=presentation
Explore at:
Dataset updated
Jan 26, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘BIZ INFOUSA’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/ffb9f9b4-8a49-4e30-b6a0-67be780fe82b on 26 January 2022.

--- Dataset description provided by original source is as follows ---

This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.
The original source for these layers are:
Business Data: InfoUSA business database purchased by DDP in 2017
Building Data: Detroit Building Footprint data
Parcel Data: from Detroit Open Data Portal, download in May 2018.
For field research by Tian, some fields have been added and some records in building and business have been edited.
For business data, Tian confirmed most of public assessable businesses and deleted those which do not exist. Also, Tian add new Business to the business data if it did not exist on the record.
For building data, Tian recorded the total business space for each building, not-empty business space, occupancy status, parking adjacency status, and took picture for every building in downtown Detroit.
Detail field META DATA:
InfoUSA Business
OBJECTID_1
COMPANY_NA: company name
ADDRESS: company address
CITY: city
STATE: state
ZIP_CODE: zip code
MAILING_CA: source InfoUSA
MAILING_DE source InfoUSA
LOCATION_A source InfoUSA: address
LOCATION_1 source InfoUSA: city
LOCATION_2 source InfoUSA: state
LOCATION_3 source InfoUSA: zip code
LOCATION_4source InfoUSA
LOCATION_5 source InfoUSA
COUNTY: county
PHONE_NUMB: phone number
WEB_ADDRES: website address
LAST_NAME: contact last name
FIRST_NAME: contact first name
CONTACT_TI: contact type
CONTACT_PR:
CONTACT_GE: contact gender
ACTUAL_EMP: employee number
EMPLOYEE_S: employee number class
ACTUAL_SAL: actual sale
SALES_VOLU: sales value
PRIMARY_SI: primary sales value
PRIMARY_1: primary classification
SECONDARY_: secondary classification
SECONDARY1
SECONDAR_1
SECONDAR_2
CREDIT_ALP: credit level
CREDIT_NUM: credit number
HEADQUARTE: headquarte
YEAR_1ST_A: year open
OFFICE_SIZ: office size
SQUARE_FOO: square foot
FIRM_INDIV:
PUBLIC_PRI
Fleet_size
FRANCHISE_
FRANCHISE1
INDUSTRY_S
ADSIZE_IN_
METRO_AREA
INFOUSA_ID
LATITUDE: y
LONGITUDE: x
PARKING: parking adjacency
NAICS_CODE: NAICS CODE
NAICS_DESC: NAICS DESCRIPTION
parcelnum*: PARCEL NUMBER
parcelobji* PARCEL OBJECT ID
CHECK_*
ACCESSIABLE* PUBLIC ACCESSIBILITY
PROPMANAGER* PROPERTY MANAGER
GlobalID
Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018
Building
OBJECTID_12
BUILDING_I: building id
PARCEL_ID : parcel id
BUILD_TYPE: building type
CITY_ID：city id
APN: parcel number
RES_SQFT: Res square feet
NONRES_SQF non-res square feet
YEAR_BUILT: year built
YEAR_DEMO
HOUSING_UN: housing units
STORIES: # of stories
MEDIAN_HGT: median height
CONDITION: building condition
HAS_CONDOS: has condos or not
FLAG_SQFT: flag square feet
FLAG_YEAR_: flag year
FLAG_CONDI: flag condition
LOADD1: address number
HIADD1 (type: esriFieldTypeInteger, alias: HIADD1, SQL Type: sqlTypeOther, nullable: true, editable: true)
STREET1: street name
LOADD2:
HIADD2 (type: esriFieldTypeString, alias: HIADD2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
STREET2 (type: esriFieldTypeString, alias: STREET2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
ZIPCODE: zip code
AKA: building name
USE_LOCATO
TEMP (type: esriFieldTypeString, alias: TEMP, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
SPID (type: esriFieldTypeInteger, alias: SPID, SQL Type: sqlTypeOther, nullable: true, editable: true)
Zone (type: esriFieldTypeString, alias: Zone, SQL Type: sqlTypeOther, length: 60, nullable: true, editable: true)
F7_2SqMile (type: esriFieldTypeString, alias: F7_2SqMile, SQL Type: sqlTypeOther, length: 10, nullable: true, editable: true)
Shape_Leng (type: esriFieldTypeDouble, alias: Shape_Leng, SQL Type: sqlTypeOther, nullable: true, editable: true)
PARKING*: parking adjacency
OCCUPANCY*: occupied or not
BuildingType* : building type
TotalBusinessSpace*: available business space in this building
NonEmptySpace*: non-empty business space in this building
CHECK_*
FOLLOWUP*: need followup or not
GlobalID*
PropmMana*: property manager
Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018

--- Original source retains full ownership of the source dataset ---
e
Personal SuperCOSMOS Science Archive (SSA) - Dataset - B2FIND
b2find.eudat.eu
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Personal SuperCOSMOS Science Archive (SSA) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/cbbfb00c-078e-51dc-bc48-2f0ccf02aee7
Explore at:
Dataset updated
Oct 20, 2022
Description
Small subset of the SuperCOSMOS Science Archive, useful for testing queries. The SuperCOSMOS data held in the SSA primarily originate from scans of Palomar and UK Schmidt blue, red and near-IR southern sky surveys. The ESO Schmidt R (dec -17.5) surveys have also been scanned and provide a 1st epoch red measurement. Further details on the surveys, the scanning process and the raw parameters extracted can be found on the further information link. The SSA is housed in a relational database running on Microsoft SQL Server 2000. Data are stored in tables which are inter-linked via reference ID numbers. In addition to the astronomical object catalogues these tables also contain information on the plates that were scanned, survey field centres and calibration coefficients. Most user science queries will only need to access the SOURCE table or to a lesser extent the DETECTION table. Detection table: cone search of detections from all plate measurements in all bands Source table: cone search of single band merged source catalog Access to two applications: general ADQL query, and asynchronous cone-search where relevant/enabled.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mauricio Vargas Sepúlveda; Mauricio Vargas Sepúlveda (2020). SQL Databases for Students and Educators [Dataset]. http://doi.org/10.5281/zenodo.4136985

SQL Databases for Students and Educators

Explore at:

bin, htmlAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.4136985

Dataset updated

Oct 28, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Mauricio Vargas Sepúlveda; Mauricio Vargas Sepúlveda

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.

I’ve decided to host different light/medium size by using PostgreSQL, MySQL and SQL Server backends (in strict descending order of preference!).

Why 3 database backends? I think there are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.

Please send me a tweet if you need the connection strings for your lectures or workshops. My Twitter username is @pachamaltese. See the SQL dumps on each section to have the data locally.

Clear search

Close search

Google apps

Main menu

SQL Databases for Students and Educators

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

Wimmera CMA Search API

Use

Fields

Filters

Source Dataset

Local Government Area

Examples

Search-Based Test Data Generation for SQL Queries: Appendix

synthetic_text_to_sql

Data from: Automating pharmacovigilance evidence generation: Using large...

Description of the data and file structure

Files and variables

File: NLQ_Queries.xlsx

Sample Dataset - HR Subject Areas

McKinsey Solve Assessment Data (2018–2025)

McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

📌 Inspiration & Purpose

🔍 Dataset Source

🧾 Dataset Structure

✅ Why Use This Dataset

💡 Credit & Collaboration

AdventureWorks-2014

About This Dataset

Dataset Composition

Usage

Documentation

AdventureWorks_SalesOrderHeader

AdventureWorks_CustomerMaster

Parcel collector

SQL Injection Test (D3)

Cleaned Retail Customer Dataset (SQL-based ETL)

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Cleaning operations

Data appraisal

‘Building’ analyzed by Analyst-2

‘Parcel collector’ analyzed by Analyst-2

Data from: "ICDAR2023 Competition on Detection and Recognition of Greek...

Use of Force department data

This is five years of police use of force data for all 468 New Jersey municipal police departments and the New Jersey State Police compiled by NJ Advance Media for The Force Report.

Below you'll find two files:

What about the form-level data?

Data analysis FAQ

‘BIZ INFOUSA’ analyzed by Analyst-2

Personal SuperCOSMOS Science Archive (SSA) - Dataset - B2FIND

SQL Databases for Students and Educators

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`