36 datasets found
  1. SQL Databases for Students and Educators

    • zenodo.org
    • data.niaid.nih.gov
    bin, html
    Updated Oct 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mauricio Vargas Sepúlveda; Mauricio Vargas Sepúlveda (2020). SQL Databases for Students and Educators [Dataset]. http://doi.org/10.5281/zenodo.4136985
    Explore at:
    bin, htmlAvailable download formats
    Dataset updated
    Oct 28, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mauricio Vargas Sepúlveda; Mauricio Vargas Sepúlveda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.

    I’ve decided to host different light/medium size by using PostgreSQL, MySQL and SQL Server backends (in strict descending order of preference!).

    Why 3 database backends? I think there are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.

    Please send me a tweet if you need the connection strings for your lectures or workshops. My Twitter username is @pachamaltese. See the SQL dumps on each section to have the data locally.

  2. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  3. d

    Wimmera CMA Search API

    • data.gov.au
    • cloud.csiss.gmu.edu
    • +1more
    csv, pdf
    Updated Aug 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wimmera CMA (2023). Wimmera CMA Search API [Dataset]. https://data.gov.au/data/dataset/wimmera-cma-search-api
    Explore at:
    pdf(156048), csv(32611124)Available download formats
    Dataset updated
    Aug 13, 2023
    Dataset authored and provided by
    Wimmera CMA
    License

    Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
    License information was derived automatically

    Description

    Search API for looking up addresses and roads within the catchment. The api can search for both address and road, or either. This dataset is updated weekly from VicMap Roads and Addresses, sourced via www.data.vic.gov.au.

    Use

    The Search API uses a data.gov.au datastore and allows a user to take full advantage of full test search functionality.

    An sql attribute is passed to the URL to define the query against the API. Please note that the attribute must be URL encoded. The sql statement takes for form as below:

    SELECT distinct display, x, y
    FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a"
    WHERE _full_text @@ to_tsquery(replace('[term]', ' ', ' %26 '))
    LIMIT 10
    

    The above will select the top 10 results from the API matching the input 'term', and return the display name as well as an x and y coordinate.

    The full URL for the above query would be:

    https://data.gov.au/api/3/action/datastore_search_sql?sql=SELECT display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('[term]', ' ', ' %26 ')) LIMIT 10)
    

    Fields

    Any field in the source dataset can be returned via the API. Display, x and y are used in the example above, but any other field can be returned by altering the select component of the sql statement. See examples below.

    Filters

    Search data sources and LGA can also be used to filter results. When not using a filter, the API defaults to using all records. See examples below.

    Source Dataset

    A filter can be applied to select for a particular source dataset using the 'src' field. The currently available datasets are as follows:

    • 1 for Roads
    • 2 for Address
    • 3 for Localities
    • 4 for Parcels (CREF and SPI)
    • 5 for Localities (Propnum)

    Local Government Area

    Filters can be applied to select for a specific local government area using the 'lga_code' field. LGA codes are derrived from Vicmap LGA datasets. Wimmeras LGAs include:

    • 332 Horsham Rural City Council
    • 330 Hindmarsh Shire Council
    • 357 Northern Grampians Shire Council
    • 371 West Wimmera Shire Council
    • 378 Yarriambiack Shire Council

    Examples

    Search for the top 10 addresses and roads with the word 'darlot' in their names:

    SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('darlot', ' ', ' & ')) LIMIT 10)
    

    example

    Search for all roads with the word 'perkins' in their names:

    SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('perkins', ' ', ' %26 ')) AND src=1
    

    example

    Search for all addresses with the word 'kalimna' in their names, within Horsham Rural City Council:

    SELECT distinct display, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('kalimna', ' ', ' %26 ')) AND src=2 and lga_code=332
    

    example

    Search for the top 10 addresses and roads with the word 'green' in their names, returning just their display name, locality, x and y:

    SELECT distinct display, locality, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE _full_text @@ to_tsquery(replace('green', ' ', ' %26 ')) LIMIT 10
    

    example

    Search all addresses in Hindmarsh Shire:

    SELECT distinct display, locality, x, y FROM "4bf30358-6dc6-412c-91ee-a6f15aaee62a" WHERE lga_code=330
    

    example

  4. Z

    Search-Based Test Data Generation for SQL Queries: Appendix

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maurício Aniche (2020). Search-Based Test Data Generation for SQL Queries: Appendix [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1166022
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Arie van Deursen
    Jeroen Castelein
    Mozhan Soltani
    Annibale Panichella
    Maurício Aniche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The appendix of our ICSE 2018 paper "Search-Based Test Data Generation for SQL Queries: Appendix".

    The appendix contains:

    The queries from the three open source systems we used in the evaluation of our tool (the industry software system is not part of this appendix, due to privacy reasons)

    The results of our evaluation.

    The source code of the tool. Most recent version can be found at https://github.com/SERG-Delft/evosql.

    The results of the tuning procedure we conducted before running the final evaluation.

  5. h

    synthetic_text_to_sql

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gretel.ai, synthetic_text_to_sql [Dataset]. https://huggingface.co/datasets/gretelai/synthetic_text_to_sql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Gretel.ai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Image generated by DALL-E. See prompt for more details

      synthetic_text_to_sql
    

    gretelai/synthetic_text_to_sql is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. Please see our release blogpost for more details. The dataset includes:

    105,851 records partitioned into 100,000 train and 5,851 test records ~23M total tokens, including ~12M SQL tokens Coverage across 100 distinct… See the full description on the dataset page: https://huggingface.co/datasets/gretelai/synthetic_text_to_sql.

  6. d

    Data from: Automating pharmacovigilance evidence generation: Using large...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery Painter; Venkateswara Chalamalasetti; Raymond Kassekert; Andrew Bate (2025). Automating pharmacovigilance evidence generation: Using large language models to produce context-aware SQL [Dataset]. http://doi.org/10.5061/dryad.2280gb63n
    Explore at:
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Jeffery Painter; Venkateswara Chalamalasetti; Raymond Kassekert; Andrew Bate
    Description

    Objective: To enhance the accuracy of information retrieval from pharmacovigilance (PV) databases by employing Large Language Models (LLMs) to convert natural language queries (NLQs) into Structured Query Language (SQL) queries, leveraging a business context document. Materials and Methods: We utilized OpenAI’s GPT-4 model within a retrieval-augmented generation (RAG) framework, enriched with a business context document, to transform NLQs into executable SQL queries. Each NLQ was presented to the LLM randomly and independently to prevent memorization. The study was conducted in three phases, varying query complexity, and assessing the LLM's performance both with and without the business context document. Results: Our approach significantly improved NLQ-to-SQL accuracy, increasing from 8.3% with the database schema alone to 78.3% with the business context document. This enhancement was consistent across low, medium, and high complexity queries, indicating the critical role of contextual ..., Test set of NLQ's used in the paper Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL. Also included are the Python scripts for the LLM processing, the R code for statistical analysis of results, and a copy of the business context document and essential tables., , # Automating Pharmacovigilance Evidence Generation: Using Large Language Models to Produce Context-Aware SQL

    https://doi.org/10.5061/dryad.2280gb63n

    Description of the data and file structure

    NLQ_Queries.xls contains the set of test NLQs along with the results of the LLM response in each phase of the experiment. Each NLQ also contains the complexity scores computed for each.

    The business context document is supplied as a PDF, together with the Python and R code used to generate our results. The essential tables used in Phase 2 and 3 of the experiment are included in the text file.

    Files and variables

    File: NLQ_Queries.xlsx

    Description:Â Contains all NLQ queries with the results of the LLM output and the pass, fail status of each.

    Column Definitions:

    Below are the column names in order with a detailed description.

    1. User NLQ: Plain text database query
    2. Phase_1:Â Pass or Fail status indicator "Pass, Partial, or Fa...
  7. Z

    Sample Dataset - HR Subject Areas

    • data.niaid.nih.gov
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weber, Marc (2023). Sample Dataset - HR Subject Areas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7447111
    Explore at:
    Dataset updated
    Jan 18, 2023
    Dataset authored and provided by
    Weber, Marc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset created as part of the Master Thesis "Business Intelligence – Automation of Data Marts modeling and its data processing".

    Lucerne University of Applied Sciences and Arts

    Master of Science in Applied Information and Data Science (MScIDS)

    Autumn Semester 2022

    Change log Version 1.1:

    The following SQL scripts were added:

        Index
        Type
        Name
    
    
        1
        View
        pg.dictionary_table
    
    
        2
        View
        pg.dictionary_column
    
    
        3
        View
        pg.dictionary_relation
    
    
        4
        View
        pg.accesslayer_table
    
    
        5
        View
        pg.accesslayer_column
    
    
        6
        View
        pg.accesslayer_relation
    
    
        7
        View
        pg.accesslayer_fact_candidate
    
    
        8
        Stored Procedure
        pg.get_fact_candidate
    
    
        9
        Stored Procedure
        pg.get_dimension_candidate
    
    
        10
        Stored Procedure
        pg.get_columns
    

    Scripts are based on Microsoft SQL Server Version 2017 and compatible with a data warehouse built with Datavault Builder. Data warehouse objects scripts of the sample data warehouse are restricted and cannot be shared.

  8. McKinsey Solve Assessment Data (2018–2025)

    • kaggle.com
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oluwademilade Adeniyi (2025). McKinsey Solve Assessment Data (2018–2025) [Dataset]. http://doi.org/10.34740/kaggle/dsv/11720554
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Oluwademilade Adeniyi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    McKinsey Solve Global Assessment Dataset (2018–2025)

    🧠 Context

    McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.

    📌 Inspiration & Purpose

    Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects

    Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.

    🔍 Dataset Source

    • Data generated by Oluwademilade Adeniyi (Demibolt) with the assistance of ChatGPT by OpenAI Structure and logic inspired by McKinsey’s public-facing Solve information, including role categories, game types (Ecosystem, Redrock, Seawolf), education tiers, and global office locations The entire dataset is synthetic and designed for analytical learning, ethical use, and professional development

    🧾 Dataset Structure

    This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%

    ✅ Why Use This Dataset

    • Benchmark educational and regional trends in global assessments
    • Build KPI cards, donut charts, histograms, or speedometer visuals
    • Train pass/fail classifiers or regression models
    • Segment job applicants by role, location, or game behaviour
    • Showcase portfolio skills across Excel, SQL, Power BI, Python, or R
    • Test dashboards or predictive logic in a business-relevant scenario

    💡 Credit & Collaboration

    • Data Creator: Oluwademilade Adeniyi (Me) (LinkedIn, Twitter, GitHub, Medium)
    • Collaborator: ChatGPT by OpenAI
    • Inspired by: McKinsey & Company’s Solve Assessment
  9. AdventureWorks-2014

    • kaggle.com
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick McKown (2024). AdventureWorks-2014 [Dataset]. https://www.kaggle.com/datasets/duckduckboot/adventureworks-2014
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Patrick McKown
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    About This Dataset

    This dataset is derived from the AdventureWorks 2014 test database published by Microsoft, and is designed to simplify and enhance data analysis workflows. The dataset consists of multiple CSV files that have been pre-joined and transformed from the original SQL database, facilitating a smoother analytical experience in Python.

    Dataset Composition

    The dataset includes: * SalesOrderHeader: Integrates the sales header and sales item tables, providing a unified view of sales transactions. * CustomerMaster: Combines customer names, countries, addresses, and other related information into a single, comprehensive file. * VendorMaster: Combines vendor names, countries, addresses, and other related information into a single, comprehensive file.

    These pre-joined CSVs aim to streamline data analysis, making it more accessible for users working in Python. The dataset can be used to showcase various Python projects or as a foundation for your own analyses.

    Usage

    Feel free to leverage this dataset for your data analysis projects, explore trends, and create visualizations. Whether you're showcasing your own Python projects or conducting independent analyses, this dataset is designed to support a wide range of data science tasks.

    Documentation

    For those interested in recreating the CSV files from the SQL database, detailed documentation is included at the bottom of this section. It provides step-by-step instructions on how to replicate the CSVs from the AdventureWorks 2014 database using SQL queries.

    AdventureWorks_SalesOrderHeader

    SELECT
      SalesOrderID
      , CAST (OrderDate AS date) AS OrderDate
      , CAST (ShipDate AS date) AS ShipDate
      , CustomerID
      , ShipToAddressID
      , BillToAddressID
      , SubTotal
      , TaxAmt
      , Freight
      , TotalDue
    FROM
      Sales.SalesOrderHeader
    

    AdventureWorks_CustomerMaster

    SELECT
      pa.AddressID
      , pbea.BusinessEntityID
      , pa.AddressLine1
      , pa.City
      , pa.PostalCode
      , psp.[Name] AS ProvinceStateName
      , pat.[Name] AS AddressType
      , pea.EmailAddress
      , ppp.PhoneNumber
      , pp.FirstName
      , pp.LastName
      , sst.CountryRegionCode
      , pcr.[Name] AS CountryName
      , sst.[Group] AS CountryGroup
    FROM 
      Person.[Address] AS pa
    INNER JOIN
      Person.BusinessEntityAddress AS pbea ON pa.AddressID = pbea.AddressID
    INNER JOIN
      Person.StateProvince AS psp ON pa.StateProvinceID = psp.StateProvinceID
    INNER JOIN
      Person.AddressType AS pat ON pbea.AddressTypeID = pat.AddressTypeID 
    INNER JOIN
      Person.EmailAddress AS pea ON pbea.BusinessEntityID = pea.BusinessEntityID
    INNER JOIN
      Person.Person AS pp ON pbea.BusinessEntityID = pp.BusinessEntityID
    INNER JOIN
      Person.PersonPhone AS ppp ON pbea.BusinessEntityID = ppp.BusinessEntityID
    INNER JOIN
      Sales.SalesTerritory AS sst ON psp.TerritoryID = sst.TerritoryID
    INNER JOIN
      Person.CountryRegion AS pcr ON sst.CountryRegionCode = pcr.CountryRegionCode;
    
  10. W

    Parcel collector

    • cloud.csiss.gmu.edu
    • detroitdata.org
    • +3more
    csv, esri rest +4
    Updated Sep 21, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2018). Parcel collector [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/parcel-collector
    Explore at:
    csv, esri rest, kml, html, geojson, zipAvailable download formats
    Dataset updated
    Sep 21, 2018
    Dataset provided by
    United States
    License

    https://data.ferndalemi.gov/datasets/565974970d8848f2a80c6eaee4242bbc_2/license.jsonhttps://data.ferndalemi.gov/datasets/565974970d8848f2a80c6eaee4242bbc_2/license.json

    Description

    This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.

    The original source for these layers are:
    1. Business Data: InfoUSA business database purchased by DDP in 2017
    2. Building Data: Detroit Building Footprint data
    3. Parcel Data: from Detroit Open Data Portal, download in May 2018.
    For field research by Tian, some fields have been added and some records in building and business have been edited.
    1. For business data, Tian confirmed most of public assessable businesses and deleted those which do not exist. Also, Tian add new Business to the business data if it did not exist on the record.
    2. For building data, Tian recorded the total business space for each building, not-empty business space, occupancy status, parking adjacency status, and took picture for every building in downtown Detroit.
    Detail field META DATA:
    InfoUSA Business
    • OBJECTID_1
    • COMPANY_NA: company name
    • ADDRESS: company address
    • CITY: city
    • STATE: state
    • ZIP_CODE: zip code
    • MAILING_CA: source InfoUSA
    • MAILING_DE source InfoUSA
    • LOCATION_A source InfoUSA: address
    • LOCATION_1 source InfoUSA: city
    • LOCATION_2 source InfoUSA: state
    • LOCATION_3 source InfoUSA: zip code
    • LOCATION_4source InfoUSA
    • LOCATION_5 source InfoUSA
    • COUNTY: county
    • PHONE_NUMB: phone number
    • WEB_ADDRES: website address
    • LAST_NAME: contact last name
    • FIRST_NAME: contact first name
    • CONTACT_TI: contact type
    • CONTACT_PR:
    • CONTACT_GE: contact gender
    • ACTUAL_EMP: employee number
    • EMPLOYEE_S: employee number class
    • ACTUAL_SAL: actual sale
    • SALES_VOLU: sales value
    • PRIMARY_SI: primary sales value
    • PRIMARY_1: primary classification
    • SECONDARY_: secondary classification
    • SECONDARY1
    • SECONDAR_1
    • SECONDAR_2
    • CREDIT_ALP: credit level
    • CREDIT_NUM: credit number
    • HEADQUARTE: headquarte
    • YEAR_1ST_A: year open
    • OFFICE_SIZ: office size
    • SQUARE_FOO: square foot
    • FIRM_INDIV:
    • PUBLIC_PRI
    • Fleet_size
    • FRANCHISE_
    • FRANCHISE1
    • INDUSTRY_S
    • ADSIZE_IN_
    • METRO_AREA
    • INFOUSA_ID
    • LATITUDE: y
    • LONGITUDE: x
    • PARKING: parking adjacency
    • NAICS_CODE: NAICS CODE
    • NAICS_DESC: NAICS DESCRIPTION
    • parcelnum*: PARCEL NUMBER
    • parcelobji* PARCEL OBJECT ID
    • CHECK_*
    • ACCESSIABLE* PUBLIC ACCESSIBILITY
    • PROPMANAGER* PROPERTY MANAGER
    • GlobalID
    Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018
    Building
    • OBJECTID_12
    • BUILDING_I: building id
    • PARCEL_ID : parcel id
    • BUILD_TYPE: building type
    • CITY_ID:city id
    • APN: parcel number
    • RES_SQFT: Res square feet
    • NONRES_SQF non-res square feet
    • YEAR_BUILT: year built
    • YEAR_DEMO
    • HOUSING_UN: housing units
    • STORIES: # of stories
    • MEDIAN_HGT: median height
    • CONDITION: building condition
    • HAS_CONDOS: has condos or not
    • FLAG_SQFT: flag square feet
    • FLAG_YEAR_: flag year
    • FLAG_CONDI: flag condition
    • LOADD1: address number
    • HIADD1 (type: esriFieldTypeInteger, alias: HIADD1, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • STREET1: street name
    • LOADD2:
    • HIADD2 (type: esriFieldTypeString, alias: HIADD2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • STREET2 (type: esriFieldTypeString, alias: STREET2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • ZIPCODE: zip code
    • AKA: building name
    • USE_LOCATO
    • TEMP (type: esriFieldTypeString, alias: TEMP, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • SPID (type: esriFieldTypeInteger, alias: SPID, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • Zone (type: esriFieldTypeString, alias: Zone, SQL Type: sqlTypeOther, length: 60, nullable: true, editable: true)
    • F7_2SqMile (type: esriFieldTypeString, alias: F7_2SqMile, SQL Type: sqlTypeOther, length: 10, nullable: true, editable: true)
    • Shape_Leng (type: esriFieldTypeDouble, alias: Shape_Leng, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • PARKING*: parking adjacency
    • OCCUPANCY*: occupied or not
    • BuildingType* : building type
    • TotalBusinessSpace*: available business space in this building
    • NonEmptySpace*: non-empty business space in this building
    • CHECK_*
    • FOLLOWUP*: need followup or not
    • GlobalID*
    • PropmMana*: property manager
    Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018

  11. Z

    SQL Injection Test (D3)

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Dec 28, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrián Campazas (2021). SQL Injection Test (D3) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5806299
    Explore at:
    Dataset updated
    Dec 28, 2021
    Dataset provided by
    Ignacio Crespo
    Adrián Campazas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset has SQL injection attacks as malicious flow data. The attacks carried out are SQL injection for Union Query and Blind SQL injection. To perform the attacks, the SQLmap tool has been used.

  12. Cleaned Retail Customer Dataset (SQL-based ETL)

    • kaggle.com
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rizwan Bin Akbar (2025). Cleaned Retail Customer Dataset (SQL-based ETL) [Dataset]. https://www.kaggle.com/datasets/rizwanbinakbar/cleaned-retail-customer-dataset-sql-based-etl/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rizwan Bin Akbar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Description

    This dataset is a collection of customer, product, sales, and location data extracted from a CRM and ERP system for a retail company. It has been cleaned and transformed through various ETL (Extract, Transform, Load) processes to ensure data consistency, accuracy, and completeness. Below is a breakdown of the dataset components: 1. Customer Information (s_crm_cust_info)

    This table contains information about customers, including their unique identifiers and demographic details.

    Columns:
    
      cst_id: Customer ID (Primary Key)
    
      cst_gndr: Gender
    
      cst_marital_status: Marital status
    
      cst_create_date: Customer account creation date
    
    Cleaning Steps:
    
      Removed duplicates and handled missing or null cst_id values.
    
      Trimmed leading and trailing spaces in cst_gndr and cst_marital_status.
    
      Standardized gender values and identified inconsistencies in marital status.
    
    1. Product Information (s_crm_prd_info / b_crm_prd_info)

    This table contains information about products, including product identifiers, names, costs, and lifecycle dates.

    Columns:
    
      prd_id: Product ID
    
      prd_key: Product key
    
      prd_nm: Product name
    
      prd_cost: Product cost
    
      prd_start_dt: Product start date
    
      prd_end_dt: Product end date
    
    Cleaning Steps:
    
      Checked for duplicates and null values in the prd_key column.
    
      Validated product dates to ensure prd_start_dt is earlier than prd_end_dt.
    
      Corrected product costs to remove invalid entries (e.g., negative values).
    
    1. Sales Details (s_crm_sales_details / b_crm_sales_details)

    This table contains information about sales transactions, including order dates, quantities, prices, and sales amounts.

    Columns:
    
      sls_order_dt: Sales order date
    
      sls_due_dt: Sales due date
    
      sls_sales: Total sales amount
    
      sls_quantity: Number of products sold
    
      sls_price: Product unit price
    
    Cleaning Steps:
    
      Validated sales order dates and corrected invalid entries.
    
      Checked for discrepancies where sls_sales did not match sls_price * sls_quantity and corrected them.
    
      Removed null and negative values from sls_sales, sls_quantity, and sls_price.
    
    1. ERP Customer Data (b_erp_cust_az12, s_erp_cust_az12)

    This table contains additional customer demographic data, including gender and birthdate.

    Columns:
    
      cid: Customer ID
    
      gen: Gender
    
      bdate: Birthdate
    
    Cleaning Steps:
    
      Checked for missing or null gender values and standardized inconsistent entries.
    
      Removed leading/trailing spaces from gen and bdate.
    
      Validated birthdates to ensure they were within a realistic range.
    
    1. Location Information (b_erp_loc_a101)

    This table contains country information related to the customers' locations.

    Columns:
    
      cntry: Country
    
    Cleaning Steps:
    
      Standardized country names (e.g., "US" and "USA" were mapped to "United States").
    
      Removed special characters (e.g., carriage returns) and trimmed whitespace.
    
    1. Product Category (b_erp_px_cat_g1v2)

    This table contains product category information.

    Columns:
    
      Product category data (no significant cleaning required).
    

    Key Features:

    Customer demographics, including gender and marital status
    
    Product details such as cost, start date, and end date
    
    Sales data with order dates, quantities, and sales amounts
    
    ERP-specific customer and location data
    

    Data Cleaning Process:

    This dataset underwent extensive cleaning and validation, including:

    Null and Duplicate Removal: Ensuring no duplicate or missing critical data (e.g., customer IDs, product keys).
    
    Date Validations: Ensuring correct date ranges and chronological consistency.
    
    Data Standardization: Standardizing categorical fields (e.g., gender, country names) and fixing inconsistent values.
    
    Sales Integrity Checks: Ensuring sales amounts match the expected product of price and quantity.
    

    This dataset is now ready for analysis and modeling, with clean, consistent, and validated data for retail analytics, customer segmentation, product analysis, and sales forecasting.

  13. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  14. Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa

    • datafirst.uct.ac.za
    Updated Jun 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eskom (2019). Domestic Electrical Load Survey Secure Data 1994-2014 - South Africa [Dataset]. https://www.datafirst.uct.ac.za/dataportal/index.php/catalog/757
    Explore at:
    Dataset updated
    Jun 20, 2019
    Dataset provided by
    Eskomhttp://www.eskom.co.za/
    University of Cape Town
    Stellenbosch University
    Time period covered
    1995 - 2014
    Area covered
    South Africa
    Description

    Abstract

    This dataset contains sensitive data that has not been disclosed in the online version of the Domestic Electrical Load Survey (DELS) 1994-2014 dataset. In contrast to the DELS dataset, the DELS Secure Data contains partially anonymised survey responses with only the names of respondents and home owners removed. The DELSS contains street and postal addresses, as well as GPS level location data for households from 2000 onwards. The GPS data is obtained through an auxiliary dataset, the Site Reference database. Like the DELS, the DELSS dataset has been retrieved and anonymised from the original SQL database with the python package delretrieve.

    Geographic coverage

    The study had national coverage.

    Analysis unit

    Households and individuals

    Universe

    The survey covers electrified households that received electricity either directly from Eskom or from their local municipality. Particular attention was devoted to rural and low income households, as well as surveying households electrified over a range of years, thus having had access to electricity from recent times to several decades.

    Kind of data

    Sample survey data

    Sampling procedure

    See sampling procedure for DELS 1994-2014

    Mode of data collection

    Face-to-face [f2f]

    Cleaning operations

    This dataset has been produced by extracting only the survey responses from the original NRS Load Research SQL database using the saveAnswers function from the delretrieve python package (https://github.com/wiebket/delretrieve: release v1.0). Full instructions on how to use delretrieve to extract data are in the README file contained in the package.

    PARTIAL DE-IDENTIFICATION Partial de-identification was done in the process of extracting the data from the SQL database with the delretrieve package. Only the names of respondents and home owners have been removed from the survey responses by replacing responses with an 'a' in the dataset. Documents with full details of the variables that have been anonymised are included as external resources.

    MISSING VALUES Other than partial de-identification no post-processing was done and all database records, including missing values, are stored exactly as retrieved.

    Data appraisal

    See notes on data quality for DELS 1994-2014

  15. A

    ‘Building’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2018). ‘Building’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-building-6cf5/a7a262a1/?iid=040-219&v=presentation
    Explore at:
    Dataset updated
    Aug 30, 2018
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Building’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/e21c9e38-783a-4155-b7e3-cefe8a02136e on 26 January 2022.

    --- Dataset description provided by original source is as follows ---

    This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.

    The original source for these layers are:
    1. Business Data: InfoUSA business database purchased by DDP in 2017
    2. Building Data: Detroit Building Footprint data
    3. Parcel Data: from Detroit Open Data Portal, download in May 2018.
    For field research by Tian, some fields have been added and some records in building and business have been edited.
    1. For business data, Tian confirmed most of public assessable businesses and deleted those which do not exist. Also, Tian add new Business to the business data if it did not exist on the record.
    2. For building data, Tian recorded the total business space for each building, not-empty business space, occupancy status, parking adjacency status, and took picture for every building in downtown Detroit.
    Detail field META DATA:
    InfoUSA Business
    • OBJECTID_1
    • COMPANY_NA: company name
    • ADDRESS: company address
    • CITY: city
    • STATE: state
    • ZIP_CODE: zip code
    • MAILING_CA: source InfoUSA
    • MAILING_DE source InfoUSA
    • LOCATION_A source InfoUSA: address
    • LOCATION_1 source InfoUSA: city
    • LOCATION_2 source InfoUSA: state
    • LOCATION_3 source InfoUSA: zip code
    • LOCATION_4source InfoUSA
    • LOCATION_5 source InfoUSA
    • COUNTY: county
    • PHONE_NUMB: phone number
    • WEB_ADDRES: website address
    • LAST_NAME: contact last name
    • FIRST_NAME: contact first name
    • CONTACT_TI: contact type
    • CONTACT_PR:
    • CONTACT_GE: contact gender
    • ACTUAL_EMP: employee number
    • EMPLOYEE_S: employee number class
    • ACTUAL_SAL: actual sale
    • SALES_VOLU: sales value
    • PRIMARY_SI: primary sales value
    • PRIMARY_1: primary classification
    • SECONDARY_: secondary classification
    • SECONDARY1
    • SECONDAR_1
    • SECONDAR_2
    • CREDIT_ALP: credit level
    • CREDIT_NUM: credit number
    • HEADQUARTE: headquarte
    • YEAR_1ST_A: year open
    • OFFICE_SIZ: office size
    • SQUARE_FOO: square foot
    • FIRM_INDIV:
    • PUBLIC_PRI
    • Fleet_size
    • FRANCHISE_
    • FRANCHISE1
    • INDUSTRY_S
    • ADSIZE_IN_
    • METRO_AREA
    • INFOUSA_ID
    • LATITUDE: y
    • LONGITUDE: x
    • PARKING: parking adjacency
    • NAICS_CODE: NAICS CODE
    • NAICS_DESC: NAICS DESCRIPTION
    • parcelnum*: PARCEL NUMBER
    • parcelobji* PARCEL OBJECT ID
    • CHECK_*
    • ACCESSIABLE* PUBLIC ACCESSIBILITY
    • PROPMANAGER* PROPERTY MANAGER
    • GlobalID
    Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018
    Building
    • OBJECTID_12
    • BUILDING_I: building id
    • PARCEL_ID : parcel id
    • BUILD_TYPE: building type
    • CITY_ID:city id
    • APN: parcel number
    • RES_SQFT: Res square feet
    • NONRES_SQF non-res square feet
    • YEAR_BUILT: year built
    • YEAR_DEMO
    • HOUSING_UN: housing units
    • STORIES: # of stories
    • MEDIAN_HGT: median height
    • CONDITION: building condition
    • HAS_CONDOS: has condos or not
    • FLAG_SQFT: flag square feet
    • FLAG_YEAR_: flag year
    • FLAG_CONDI: flag condition
    • LOADD1: address number
    • HIADD1 (type: esriFieldTypeInteger, alias: HIADD1, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • STREET1: street name
    • LOADD2:
    • HIADD2 (type: esriFieldTypeString, alias: HIADD2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • STREET2 (type: esriFieldTypeString, alias: STREET2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • ZIPCODE: zip code
    • AKA: building name
    • USE_LOCATO
    • TEMP (type: esriFieldTypeString, alias: TEMP, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • SPID (type: esriFieldTypeInteger, alias: SPID, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • Zone (type: esriFieldTypeString, alias: Zone, SQL Type: sqlTypeOther, length: 60, nullable: true, editable: true)
    • F7_2SqMile (type: esriFieldTypeString, alias: F7_2SqMile, SQL Type: sqlTypeOther, length: 10, nullable: true, editable: true)
    • Shape_Leng (type: esriFieldTypeDouble, alias: Shape_Leng, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • PARKING*: parking adjacency
    • OCCUPANCY*: occupied or not
    • BuildingType* : building type
    • TotalBusinessSpace*: available business space in this building
    • NonEmptySpace*: non-empty business space in this building
    • CHECK_*
    • FOLLOWUP*: need followup or not
    • GlobalID*
    • PropmMana*: property manager
    Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018

    --- Original source retains full ownership of the source dataset ---

  16. A

    ‘Parcel collector’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Parcel collector’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-parcel-collector-b20c/fdf9e6d7/?iid=035-740&v=presentation
    Explore at:
    Dataset updated
    Jan 26, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Parcel collector’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/23f2a097-00e0-44ce-9eb1-c79232471121 on 26 January 2022.

    --- Dataset description provided by original source is as follows ---

    This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.

    The original source for these layers are:
    1. Business Data: InfoUSA business database purchased by DDP in 2017
    2. Building Data: Detroit Building Footprint data
    3. Parcel Data: from Detroit Open Data Portal, download in May 2018.
    For field research by Tian, some fields have been added and some records in building and business have been edited.
    1. For business data, Tian confirmed most of public assessable businesses and deleted those which do not exist. Also, Tian add new Business to the business data if it did not exist on the record.
    2. For building data, Tian recorded the total business space for each building, not-empty business space, occupancy status, parking adjacency status, and took picture for every building in downtown Detroit.
    Detail field META DATA:
    InfoUSA Business
    • OBJECTID_1
    • COMPANY_NA: company name
    • ADDRESS: company address
    • CITY: city
    • STATE: state
    • ZIP_CODE: zip code
    • MAILING_CA: source InfoUSA
    • MAILING_DE source InfoUSA
    • LOCATION_A source InfoUSA: address
    • LOCATION_1 source InfoUSA: city
    • LOCATION_2 source InfoUSA: state
    • LOCATION_3 source InfoUSA: zip code
    • LOCATION_4source InfoUSA
    • LOCATION_5 source InfoUSA
    • COUNTY: county
    • PHONE_NUMB: phone number
    • WEB_ADDRES: website address
    • LAST_NAME: contact last name
    • FIRST_NAME: contact first name
    • CONTACT_TI: contact type
    • CONTACT_PR:
    • CONTACT_GE: contact gender
    • ACTUAL_EMP: employee number
    • EMPLOYEE_S: employee number class
    • ACTUAL_SAL: actual sale
    • SALES_VOLU: sales value
    • PRIMARY_SI: primary sales value
    • PRIMARY_1: primary classification
    • SECONDARY_: secondary classification
    • SECONDARY1
    • SECONDAR_1
    • SECONDAR_2
    • CREDIT_ALP: credit level
    • CREDIT_NUM: credit number
    • HEADQUARTE: headquarte
    • YEAR_1ST_A: year open
    • OFFICE_SIZ: office size
    • SQUARE_FOO: square foot
    • FIRM_INDIV:
    • PUBLIC_PRI
    • Fleet_size
    • FRANCHISE_
    • FRANCHISE1
    • INDUSTRY_S
    • ADSIZE_IN_
    • METRO_AREA
    • INFOUSA_ID
    • LATITUDE: y
    • LONGITUDE: x
    • PARKING: parking adjacency
    • NAICS_CODE: NAICS CODE
    • NAICS_DESC: NAICS DESCRIPTION
    • parcelnum*: PARCEL NUMBER
    • parcelobji* PARCEL OBJECT ID
    • CHECK_*
    • ACCESSIABLE* PUBLIC ACCESSIBILITY
    • PROPMANAGER* PROPERTY MANAGER
    • GlobalID
    Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018
    Building
    • OBJECTID_12
    • BUILDING_I: building id
    • PARCEL_ID : parcel id
    • BUILD_TYPE: building type
    • CITY_ID:city id
    • APN: parcel number
    • RES_SQFT: Res square feet
    • NONRES_SQF non-res square feet
    • YEAR_BUILT: year built
    • YEAR_DEMO
    • HOUSING_UN: housing units
    • STORIES: # of stories
    • MEDIAN_HGT: median height
    • CONDITION: building condition
    • HAS_CONDOS: has condos or not
    • FLAG_SQFT: flag square feet
    • FLAG_YEAR_: flag year
    • FLAG_CONDI: flag condition
    • LOADD1: address number
    • HIADD1 (type: esriFieldTypeInteger, alias: HIADD1, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • STREET1: street name
    • LOADD2:
    • HIADD2 (type: esriFieldTypeString, alias: HIADD2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • STREET2 (type: esriFieldTypeString, alias: STREET2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • ZIPCODE: zip code
    • AKA: building name
    • USE_LOCATO
    • TEMP (type: esriFieldTypeString, alias: TEMP, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • SPID (type: esriFieldTypeInteger, alias: SPID, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • Zone (type: esriFieldTypeString, alias: Zone, SQL Type: sqlTypeOther, length: 60, nullable: true, editable: true)
    • F7_2SqMile (type: esriFieldTypeString, alias: F7_2SqMile, SQL Type: sqlTypeOther, length: 10, nullable: true, editable: true)
    • Shape_Leng (type: esriFieldTypeDouble, alias: Shape_Leng, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • PARKING*: parking adjacency
    • OCCUPANCY*: occupied or not
    • BuildingType* : building type
    • TotalBusinessSpace*: available business space in this building
    • NonEmptySpace*: non-empty business space in this building
    • CHECK_*
    • FOLLOWUP*: need followup or not
    • GlobalID*
    • PropmMana*: property manager
    Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018

    --- Original source retains full ownership of the source dataset ---

  17. Z

    Data from: "ICDAR2023 Competition on Detection and Recognition of Greek...

    • data.niaid.nih.gov
    Updated Sep 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serbaeva, Olga (2024). "ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri" Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13825618
    Explore at:
    Dataset updated
    Sep 24, 2024
    Dataset provided by
    Agolli, Selaudin
    Seuret, Mathias
    Rodriguez-Salas, Dalia
    Christlein, Vincent
    Carrière, Guillaume
    White, Stephen
    Marthot-Santaniello, Isabelle
    Serbaeva, Olga
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset description of the “ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri”

    Prof. Dr. Isabelle Marthot-Santaniello, Dr. Olga Serbaeva

    2024.09.16

    Introduction

    The present dataset stems from the ICDAR2023 Competition on Detection and Recognition of Greek Letters on Papyri (original links to the competition are provided in the file “1b.CompetitionLinks.”)

    The aim of this competition was to investigate the performance of glyph detection and recognition in a very challenging type of historical document: Greek papyri. The detection and recognition of Greek letters on papyri is a preliminary step for computational analysis of handwriting that can lead to major steps forward in our understanding of this important source of information on Antiquity. Such detection and recognition can be done manually by trained papyrologists. It is, however, a time-consuming task that would need automatising.

    We provide here the documents related to two different tasks: localisation and classification. The document images are provided by several institutions and are representative of the diversity of book hands on papyri (a millennium time span, various script styles, provenance, states of preservation, means of digitization and resolution).

    How the dataset was constructed

    In the frame of D-Scribes project lead by Prof. Dr. Isabelle Marthot-Santaniello, 2018-2023, around 150 papyri fragments containing Iliad were manually annotated at a letter-level in READ.

    The editions were taken, for the major part, from papyri.info, and were simplified, i.e. the accents, editorial marks, and other additional information were removed to be as close as possible to what is to be found on papyri. When the text was not available on papyri.info, the relevant passage was extracted from the Homer Iliad of Perseus.

    From those, 150 plus papyri fragments, 185 surfaces (sides of fragments) belonging to 136 different manuscript identified by their Trismegistos numbers, (further TMs) were selected to serve as a material for Competition. These 185 surfaces were separated into the “training set” and the “test set” provided for the competition as a set of images and corresponding data in JSON format.

    Details on the competition summarised in "ICDAR 2023 Competition on Detection and Recognition of Greek Letters on Papyri", by Mathias Seuret, Isabelle Marthot-Santaniello, Stephen A. White, Olga Serbaeva Saraogi, Selaudin Agolli, Guillaume Carrière, Dalia Rodriguez-Salas, and Vincent Christlein; edited by G. A. Fink et al. (Eds.): ICDAR 2023, LNCS 14188, pp. 498–507, 2023. https://doi.org/10.1007/978-3-031-41679-8_29.

    After the competition ended, the decision was taken to release manually annotated dataset for the “test set” as well. Please find the description of each included document below.

    Dataset Structure

    “1. CompetitionOverview.xlsx” contains the metadata of the used images in Excel file, state 2024.09.19. Here is the structure of the Excel file:

    Excel columns

    Name

    Content

    Notes

    A

    TM

    Trismegistos number is internationally used for papyri identification

    With READ item name in ().

    B

    Papyri.info link

    link

    C

    Fragments' Owning Institution (from papyri.info)

    Institution’s name

    Institution that physically stores the papyri

    D

    Availability (of metadata, papyri.info)

    link

    Metadata reuse clarification

    E

    text ID (READ)

    Number from READ SQL database that was used to link the images and the editions.

    Serves to locate the attached images and understand the JSON structure.

    F

    Test/Training

    I.e. the image was originally included in the training or in the test set of the dataset.

    G

    Image Name (for orientation)

    As in READ

    H

    Cedopal link

    link

    Contains additional metadata and includes the links to all available online images.

    I

    License from the Institution webpage.

    Either license or usage summary.

    If no precise licence has been given, the summary of the reuse rights is provided with a link to the regulations in column K

    J

    Image URL

    link

    Not all images are available online. Please contact the owning institution directly if the image is not available.

    K

    Information on the image usage from the institution

    link

    In case of any doubt, please contact the owning institution directly.

    L

    Notes

    For the purpose of an easy overview, the items with special problems, i.e. images not online or missing links, have been marked in red.

    1. There are three data subsets:

    2a. “Training file” (containing 150 papyri images separated into 108 texts and HomerCompTraining.json). The images are those of papyri containing Iliad of Homer in JPG-format. These were processed in READ, namely, each visible letter on a given papyri was linked to the edition of the Iliad, through this process, each linked letter of the edition was linked to its coordinates in pixels on the HTML-surface of the image. All that information is provided in the JSON-file.

    The JSON file contains the “annotations” (b-boxes of each letter/sign), “categories” (Greek letters), “images” (Image IDs), and “licenses”. The links between image and bboxes is defined via the “id” in the “images” part (for example, "id": 6109). This same id is encoded as “"image_id": 6109” in the “annotations”. Alternatively, “text_id” which can be found in the “images” URL and in the file-names provided here and containing images, can be used for data linking.

    Let us now describe the content of each part of the JSON file:Each “annotation” contains“area" characterised as “bbox" with coordinates, “category_id”, that allows to identify which Greek letter in categories is represented by the number; “id”, which is a unique number of the cliplet, i.e. area; “image_id”, that links cliplet to the surface of the image having the same id; “iscrowd" and “seg_id" are useful to find the information back in READ database; and, finally, “tags”.

    In tags, “BaseType" was used to annotate quality as described below. “FootMarkType”, ft1, etc., was used for clustering tests, but played no role for the Competition.“BaseType” ot bt-tags were assigned to the letters to mark the quality of preservation: bt-1: well-preserved letter that should allows easy identification for both human eyes and the Computer-vision; bt-2: Partially preserved letter that might also have some background damage (holes, additional ink, etc), but remains readable, and has one interpretation. bt-3: Letters damaged to such an extant that they cannot be identified without reading an edition. These are treated as traces of ink. bt-4: The letters that have some damage, but this damage is of such kind that it makes possible multiple interpretations. For example, missing/defaced horizontal stroke makes alpha indistinguishable from damaged delta or lambda.

    Each “category” contains “id”, this is a number references also in “annotations” and it allows to identify which Greek letter was in the bbox; ”name”, for example, “χ”; and “supercategory”, i.e. “Greek”.

    Each “image” contains the following sub fields: “bln_id" is an internal READ number of the html surface; "date_captured": null - is another READ field; "file_name": “./images/homer2/txt1/P.Corn.Inv.MSS.A.101.XIII.jpg", allows to link easy image and text, i.e. for the image in question the JPG will be in the file called “txt1”, it is very similar by structure and function to "img_url": "./images/homer2/txt1/P.Corn.Inv.MSS.A.101.XIII.jpg"; each image has “height" and “width" expressed in pixels. Each image has “id”, and this id is referenced in the “annotations” under “image_id”. Finally, each image contains a link to “license”, expressed as a number.

    Each “licence” lists a license as it was found during the time of competition, i.e. in February 2023.

    2b. “Test file” contains 34 papyri image sides separated into 31 TMs and HomerCompTesting.json The JSON file here only allows to connect the images with the “categories”, “images”, “licenses”, but without the “annotations”. The structure and logic is otherwise the same like in “Training” JSON.

    2c. “Answers file” Containing the “annotations” and other information for the 34 papyri of the “Testing” dataset. The structure and logic is the same like in “Training” JSON.

    1. “Additional files” Containing lists of duplicate segments id (multiple possible readings or tags), respectively 6 items for “Training”, 17 for “Testing” and 15 for “Answers”.

    2. “Dataset Description”This same description included for completeness.

    References

    The Dataset was reused or mentioned in a number of publications (state September 2024)

    Mohammed, H., Jampour, M. (2024). "From Detection to Modelling: An End-to-End Paleographic System for Analysing Historical Handwriting Styles". In: Sfikas, G., Retsinas, G. (eds) Document Analysis Systems. DAS 2024. Lecture Notes in Computer Science, vol 14994. Springer, Cham, pp. 363–376. https://doi.org/10.1007/978-3-031-70442-0_22

    De Gregorio, G., Perrin, S., Pena, R.C.G., Marthot-Santaniello, I., Mouchère, H. (2024). "NeuroPapyri: A Deep Attention Embedding Network for Handwritten Papyri Retrieval". In: Mouchère, H., Zhu, A. (eds) Document Analysis and Recognition – ICDAR 2024 Workshops. ICDAR 2024. Lecture Notes in Computer Science, vol 14936. Springer, Cham, pp. 71–86. https://doi.org/10.1007/978-3-031-70642-4_5

    Vu, M. T., Beurton-Aimar, M. "PapyTwin net: a Twin network for Greek letters detection on ancient Papyri". HIP '23: 7th International Workshop on Historical Document Imaging and Processing, San Jose, CA, USA, August

  18. d

    Use of Force department data

    • data.world
    csv, zip
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NJ Advance Data Team (2024). Use of Force department data [Dataset]. https://data.world/njdotcom/use-of-force-department-data
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Mar 8, 2024
    Authors
    NJ Advance Data Team
    Description

    This is five years of police use of force data for all 468 New Jersey municipal police departments and the New Jersey State Police compiled by NJ Advance Media for The Force Report.

    When police punch, pepper spray or use other force against someone in New Jersey, they are required to fill out a form detailing what happened. NJ Advance Media filed 506 public records requests and received 72,607 forms covering 2012 through 2016. For more data collection details, see our Methodology here. Data cleaning details can be found here.

    We then cleaned, analyzed and compiled the data by department to get a better look at what departments were using the most force, what type of force they were using, and who they were using it on. The result, our searchable database, can be found at NJ.com/force. But we wanted to make department-level results — our aggregate data — available in another way to the broader public.

    Below you'll find two files:

    • UOF_BY_DEPARTMENTS.csv, with every department's summary data, including the State Police. (This is important to note because the State Police patrols multiple towns and may not be comparable to other departments.)
    • UOF_STATEWIDE.csv, a statewide summary of the same data.

    For more details on individual columns, see the data dictionary for UOF_BY_DEPARTMENTS. We have also created sample SQL queries to make it easy for users to quickly find their town or county.

    It's important to note that these forms were self-reported by police officers, sometimes filled out by hand, so even our data cleaning can't totally prevent inaccuracies from cropping up. We've also included comparisons to population data (from the Census) and arrest data (from the FBI Uniform Crime Report), to try to help give context to what you're seeing.

    What about the form-level data?

    We have included individual incidents on each department page, but we are not publishing the form-level data freely to the public. Not only is that data extremely dirty and difficult to analyze — at least, it took us six months — but it contains private information about subjects of force, including minors and people with mental health issues. However, we are planning to make a version of that file available upon request in the future.

    Data analysis FAQ

    What are rows? What are incidents?
    Every time any police officer uses force against a subject, they must fill out a form detailing what happened and what force they used. But sometimes multiple police officers used force against the same subject in the same incident. "Rows" are individual forms officers filled out, "incidents" are unique incidents based on the incident number and date.

    What are the odds ratios, and how did you calculate them?
    We wanted a simple way of showing readers the disparity between black and white subjects in a particular town. So we used an odds ratio, a statistical method often used in research to compare the odds of one thing happening to another. For population, the calculation was (Number of black subjects/Total black population of area)/(Number of white subjects/Total white population of area). For arrests, the calculation was (Number of black subjects/Total number of black arrests in area)/(Number of white subjects/Total number of white arrests in area). In addition, when we compared anything to arrests, we took out all incidents where the subject was an EDP (emotionally disturbed person).

    What are the NYC/LA/Chicago warning systems?
    Those three departments each look at use of force to flag officers if they show concerning patterns, as way to select those that could merit more training or other action by the department. We compared our data to those three systems to see how many officers would trigger the early warning systems for each. Here are the three systems: - In New York City, officers are flagged for review if they use higher levels of force — including a baton, Taser or firearm, but not pepper spray — or if anyone was injured or hospitalized. We calculated this number by identifying every officer who met one or more of the criteria. - In Los Angeles, officers are compared with one another based on 14 variables, including use of force. If an officer ranks significantly higher than peers for any of the variables — technically, 3 standards of deviation from the norm — supervisors are automatically notified. We calculated this number conservatively by using only use of force as a variable over the course of a calendar year. - In Chicago, officers are flagged for review if force results in an injury or hospitalization, or if the officer uses any level of force above punches or kicks. We calculated this number by identifying every officer who met one or more of the criteria.

    What are the different levels of force?
    Each officer was required to include in the form what type of force they used against a subject. We cleaned and standardized the data to major categories, although officers could write-in a different type of force if they wanted to. Here are the major categories: - Compliance hold: A compliance hold is a painful maneuver using pressure points to gain control over a suspect. It is the lowest level of force and the most commonly used. But it is often used in conjunction with other types of force. - Takedown: This technique is used to bring a suspect to the ground and eventually onto their stomach to cuff them. It can be a leg sweep or a tackle. - Hands/fist: Open hands or closed fist strikes/punches. - Leg strikes: Leg strikes are any kick or knee used on a subject. - Baton: Officers are trained to use a baton when punches or kicks are unsuccessful. - Pepper spray: Police pepper spray, a mist derived from the resin of cayenne pepper, is considered “mechanical force” under state guidelines. - Deadly force: The firing of an officer's service weapon, regardless of whether a subject was hit. “Warning shots” are prohibited, and officers are instructed not to shoot just to maim or subdue a suspect.

  19. A

    ‘BIZ INFOUSA’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘BIZ INFOUSA’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-biz-infousa-dd76/02222eb0/?iid=042-438&v=presentation
    Explore at:
    Dataset updated
    Jan 26, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘BIZ INFOUSA’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/ffb9f9b4-8a49-4e30-b6a0-67be780fe82b on 26 January 2022.

    --- Dataset description provided by original source is as follows ---

    This is a collection of layers created by Tian Xie(Intern in DDP) in August, 2018. This collection includes Detroit Parcel Data(Parcel_collector), InfoUSA business data(BIZ_INFOUSA), and building data(Building). The building and business data have been edited by Tian during field research and have attached images.

    The original source for these layers are:
    1. Business Data: InfoUSA business database purchased by DDP in 2017
    2. Building Data: Detroit Building Footprint data
    3. Parcel Data: from Detroit Open Data Portal, download in May 2018.
    For field research by Tian, some fields have been added and some records in building and business have been edited.
    1. For business data, Tian confirmed most of public assessable businesses and deleted those which do not exist. Also, Tian add new Business to the business data if it did not exist on the record.
    2. For building data, Tian recorded the total business space for each building, not-empty business space, occupancy status, parking adjacency status, and took picture for every building in downtown Detroit.
    Detail field META DATA:
    InfoUSA Business
    • OBJECTID_1
    • COMPANY_NA: company name
    • ADDRESS: company address
    • CITY: city
    • STATE: state
    • ZIP_CODE: zip code
    • MAILING_CA: source InfoUSA
    • MAILING_DE source InfoUSA
    • LOCATION_A source InfoUSA: address
    • LOCATION_1 source InfoUSA: city
    • LOCATION_2 source InfoUSA: state
    • LOCATION_3 source InfoUSA: zip code
    • LOCATION_4source InfoUSA
    • LOCATION_5 source InfoUSA
    • COUNTY: county
    • PHONE_NUMB: phone number
    • WEB_ADDRES: website address
    • LAST_NAME: contact last name
    • FIRST_NAME: contact first name
    • CONTACT_TI: contact type
    • CONTACT_PR:
    • CONTACT_GE: contact gender
    • ACTUAL_EMP: employee number
    • EMPLOYEE_S: employee number class
    • ACTUAL_SAL: actual sale
    • SALES_VOLU: sales value
    • PRIMARY_SI: primary sales value
    • PRIMARY_1: primary classification
    • SECONDARY_: secondary classification
    • SECONDARY1
    • SECONDAR_1
    • SECONDAR_2
    • CREDIT_ALP: credit level
    • CREDIT_NUM: credit number
    • HEADQUARTE: headquarte
    • YEAR_1ST_A: year open
    • OFFICE_SIZ: office size
    • SQUARE_FOO: square foot
    • FIRM_INDIV:
    • PUBLIC_PRI
    • Fleet_size
    • FRANCHISE_
    • FRANCHISE1
    • INDUSTRY_S
    • ADSIZE_IN_
    • METRO_AREA
    • INFOUSA_ID
    • LATITUDE: y
    • LONGITUDE: x
    • PARKING: parking adjacency
    • NAICS_CODE: NAICS CODE
    • NAICS_DESC: NAICS DESCRIPTION
    • parcelnum*: PARCEL NUMBER
    • parcelobji* PARCEL OBJECT ID
    • CHECK_*
    • ACCESSIABLE* PUBLIC ACCESSIBILITY
    • PROPMANAGER* PROPERTY MANAGER
    • GlobalID
    Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018
    Building
    • OBJECTID_12
    • BUILDING_I: building id
    • PARCEL_ID : parcel id
    • BUILD_TYPE: building type
    • CITY_ID:city id
    • APN: parcel number
    • RES_SQFT: Res square feet
    • NONRES_SQF non-res square feet
    • YEAR_BUILT: year built
    • YEAR_DEMO
    • HOUSING_UN: housing units
    • STORIES: # of stories
    • MEDIAN_HGT: median height
    • CONDITION: building condition
    • HAS_CONDOS: has condos or not
    • FLAG_SQFT: flag square feet
    • FLAG_YEAR_: flag year
    • FLAG_CONDI: flag condition
    • LOADD1: address number
    • HIADD1 (type: esriFieldTypeInteger, alias: HIADD1, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • STREET1: street name
    • LOADD2:
    • HIADD2 (type: esriFieldTypeString, alias: HIADD2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • STREET2 (type: esriFieldTypeString, alias: STREET2, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • ZIPCODE: zip code
    • AKA: building name
    • USE_LOCATO
    • TEMP (type: esriFieldTypeString, alias: TEMP, SQL Type: sqlTypeOther, length: 80, nullable: true, editable: true)
    • SPID (type: esriFieldTypeInteger, alias: SPID, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • Zone (type: esriFieldTypeString, alias: Zone, SQL Type: sqlTypeOther, length: 60, nullable: true, editable: true)
    • F7_2SqMile (type: esriFieldTypeString, alias: F7_2SqMile, SQL Type: sqlTypeOther, length: 10, nullable: true, editable: true)
    • Shape_Leng (type: esriFieldTypeDouble, alias: Shape_Leng, SQL Type: sqlTypeOther, nullable: true, editable: true)
    • PARKING*: parking adjacency
    • OCCUPANCY*: occupied or not
    • BuildingType* : building type
    • TotalBusinessSpace*: available business space in this building
    • NonEmptySpace*: non-empty business space in this building
    • CHECK_*
    • FOLLOWUP*: need followup or not
    • GlobalID*
    • PropmMana*: property manager
    Notes: field with * means it came from other source or field research done by Tian Xie in Aug, 2018

    --- Original source retains full ownership of the source dataset ---

  20. e

    Personal SuperCOSMOS Science Archive (SSA) - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Personal SuperCOSMOS Science Archive (SSA) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/cbbfb00c-078e-51dc-bc48-2f0ccf02aee7
    Explore at:
    Dataset updated
    Oct 20, 2022
    Description

    Small subset of the SuperCOSMOS Science Archive, useful for testing queries. The SuperCOSMOS data held in the SSA primarily originate from scans of Palomar and UK Schmidt blue, red and near-IR southern sky surveys. The ESO Schmidt R (dec -17.5) surveys have also been scanned and provide a 1st epoch red measurement. Further details on the surveys, the scanning process and the raw parameters extracted can be found on the further information link. The SSA is housed in a relational database running on Microsoft SQL Server 2000. Data are stored in tables which are inter-linked via reference ID numbers. In addition to the astronomical object catalogues these tables also contain information on the plates that were scanned, survey field centres and calibration coefficients. Most user science queries will only need to access the SOURCE table or to a lesser extent the DETECTION table. Detection table: cone search of detections from all plate measurements in all bands Source table: cone search of single band merged source catalog Access to two applications: general ADQL query, and asynchronous cone-search where relevant/enabled.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mauricio Vargas Sepúlveda; Mauricio Vargas Sepúlveda (2020). SQL Databases for Students and Educators [Dataset]. http://doi.org/10.5281/zenodo.4136985
Organization logo

SQL Databases for Students and Educators

Explore at:
bin, htmlAvailable download formats
Dataset updated
Oct 28, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mauricio Vargas Sepúlveda; Mauricio Vargas Sepúlveda
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Publicly accessible databases often impose query limits or require registration. Even when I maintain public and limit-free APIs, I never wanted to host a public database because I tend to think that the connection strings are a problem for the user.

I’ve decided to host different light/medium size by using PostgreSQL, MySQL and SQL Server backends (in strict descending order of preference!).

Why 3 database backends? I think there are a ton of small edge cases when moving between DB back ends and so testing lots with live databases is quite valuable. With this resource you can benchmark speed, compression, and DDL types.

Please send me a tweet if you need the connection strings for your lectures or workshops. My Twitter username is @pachamaltese. See the SQL dumps on each section to have the data locally.

Search
Clear search
Close search
Google apps
Main menu