10 datasets found

Cyclistic_data_visualization
kaggle.com
Updated Jun 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Woychick (2021). Cyclistic_data_visualization [Dataset]. https://www.kaggle.com/markwoychick/cyclistic-data-visualization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mark Woychick
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.

Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.

Content

I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.

Acknowledgements

Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html

Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy

Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)

Inspiration

This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.

One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.

I am also interested to see what others did with this analysis - what were the findings and insights you found?
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
Explore at:
Unique identifier
https://doi.org/10.3886/E105403V3
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
f
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
figshare
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
H
Replication Data for: Trajectories of mental health problems in childhood...
dataverse.harvard.edu
search.dataone.org
Updated Dec 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisa-Christine Girard; Martin Okolikj (2022). Replication Data for: Trajectories of mental health problems in childhood and adult voting behaviour: Evidence from the 1970s British Cohort Study [Dataset]. http://doi.org/10.7910/DVN/S6UUBF
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/S6UUBF
Dataset updated
Dec 12, 2022
Dataset provided by
Harvard Dataverse
Authors
Lisa-Christine Girard; Martin Okolikj
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This file describes the replication material for: Trajectories of mental health problems in childhood and adult voting behaviour: Evidence from the 1970s British Cohort Study. Authors: Lisa-Christine Girard & Martin Okolikj. Accepted in Political Behavior. This dataverse holds the following 4 replication files: 1. data_cleaning_traj.R - This file is designed to load, merge and clean the datasets for the estimation of trajectories along with the rescaling of the age 10 Rutter scale. This file was prepared using R-4.1.1 version. 2. traj_estimation.do - With the dataset merged from data_cleaning_traj.R, we run this file in STATA to create and estimate trajectories, to be included in the full dataset. This file was prepared using STATA 17.0 version. 3. data_cleaning.R - This is the file designed to load, merge and clean all datasets in one for preparation of the main analysis following the trajectory estimation. This file was prepared using R-4.1.1 version. 4. POBE Analysis.do - The analysis file is designed to generate the results from the tables in the published paper along with all supplementary materials. This file was prepared using STATA 17.0 version. The data can be accessed at the following address. It requires user registration under special licence conditions: http://discover.ukdataservice.ac.uk/series/?sn=200001. If you have any questions or spot any errors please contact g.lisachristine@gmail.com or martin.okolic@gmail.com.
Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
openicpsr.org
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2018 [Dataset]. http://doi.org/10.3886/E105403
Explore at:
Unique identifier
https://doi.org/10.3886/E105403
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 4 release notes:Adds data for 2018Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
Superstore Sales Analysis
kaggle.com
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Reda Elblgihy
Description
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.

Utilize Power Query to clean, transform, and structure the data for analysis.

Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.

Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.

Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.

Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.

Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.

Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
e
Code to Create a Combined British Household Panel Survey and Understanding...
b2find.eudat.eu
Updated Jul 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Code to Create a Combined British Household Panel Survey and Understanding Society Data File, 1991-2020 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/4ac10493-2c19-5dee-84a1-fab63e35bc8e
Explore at:
Dataset updated
Jul 10, 2023
Description
The Modelling Housing Career Trajectories in Great Britain project used a range of datasets to examine how people’s pathways through the British housing system have changed since the 1990s. The project made extensive use of secondary household panel survey data from the British Household Panel Survey (BHPS, 1991-2008) and its successor, Understanding Society (UKHLS, 2009-present). This collection consists of project R code written to (1) extract and then (2) combine data from across the various files of BHPS and UKHLS into one ‘master’ longitudinal dataset containing annual records for all observed individuals from 1991 to 2020. A number of key individual and household-level variables (including age, sex, country of birth, partnership status, highest educational qualification, employment status, incomes and region) are then (3) cleaned and harmonised. Users can download and adapt the deposited code as needed for their own social science applications.The Modelling Housing Career Trajectories in Great Britain project aimed to develop our understanding of how people's pathways through the housing market are changing in 21st Century Britain. To do this, one strand of empirical research used UK household panel survey data (the British Household Panel Survey and its successor, Understanding Society) to examine housing career pathways and homeownership transitions since the 1990s. This collection contains R code written to (1) extract BHPS and Understanding Society data, (2) combine various data files from the two collections and (3) produce a basic set of harmonised BHSP-Understanding Society variables to support longitudinal analysis covering the 1991-present period. Users can download and adapt the code for their own research projects. This R script was written to extract, combine and clean household panel survey data from the BHPS and UKHLS. The code was written in RStudio 2022.7.2.0 for R 4.2.2.
Automatic Number Plate Recognition Data from Bristol, England for Apr-Jun...
zenodo.org
data.niaid.nih.gov
bin, csv, pdf
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fiona Crawford; Fiona Crawford (2024). Automatic Number Plate Recognition Data from Bristol, England for Apr-Jun 2019, 2020 and 2021 [Dataset]. http://doi.org/10.5281/zenodo.6723265
Explore at:
csv, pdf, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6723265
Dataset updated
Jul 16, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fiona Crawford; Fiona Crawford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bristol
Description
These files contain data derived from Automatic Number Plate Recognition (ANPR) data collected by Bristol City Council as part of their continuous traffic monitoring work. The data was used in a project examining the impact of Covid-19 on travel behaviour in Bristol and the implications for air quality.

Number plate data cannot be shared, therefore the files contain anonymised vehicle identifiers. Also, specific information about the locations of the ANPR cameras cannot be provided under the data sharing agreement we have with the data owners. Figures 4 and 19 in the project report (see uploaded pdf file) indicate the approximate locations of the ANPR cameras.

The following files have been uploaded:

- a metadata file containing descriptions of variables in the vehicle and trip files

- a copy of the report which came out of this project.

- BCCANPR21_REG_descrip_withID_NOVRM.csv: This file includes vehicle information for all vehicles observed in the trip data. The anonymised identification numbers can be used to merge this data with the trip files.

- BCCANPR21_2019AprMayJun_REG_idv2.csv: Three files of this format contain trip data. Each file contains trips in April to June of the year specified in the file name. The number plate data was processed to chain together observations that were part of the same trip and this file only contains the location of the first and last ANPR camera at which this trip was recorded.

This project was funded by the TRANSITION Clean Air Network through the first round of its Discovery & Innovation Fund in 2021. TRANSITION is a UK-wide network, led by the University of Birmingham in collaboration with nine universities and over 20 cross-sector partners, aiming to optimise the air quality and health outcomes of transport decarbonisation. The network (NERC ref. NE/V002449/1) is itself funded by UK Research & Innovation through its Clean Air Strategic Priorities Fund, administered by the Natural Environment Research Council.

The number plate data used in this research was provided by the Bristol Traffic Control Service at Bristol City Council. The vehicle data was obtained through the DVLA’s Vehicle Enquiry Service API.
A Hybrid Educational Dataset
kaggle.com
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emanoel Carvalho Lopes (2025). A Hybrid Educational Dataset [Dataset]. https://www.kaggle.com/datasets/emanoelcarvalholopes/uci-oulad-sintetico-unificados/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Emanoel Carvalho Lopes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The early identification of students facing learning difficulties is one of the most critical challenges in modern education. Intervening effectively requires leveraging data to understand the complex interplay between student demographics, engagement patterns, and academic performance.

This dataset was created to serve as a high-quality, pre-processed resource for building machine learning models to tackle this very problem. It is a unique hybrid dataset, meticulously crafted by unifying three distinct sources:

The Open University Learning Analytics Dataset (OULAD): A rich dataset detailing student interactions with a Virtual Learning Environment (VLE). We have aggregated the raw, granular data (over 10 million interaction logs) into powerful features, such as total clicks, average assessment scores, and distinct days of activity for each student registration.

The UCI Student Performance Dataset: A classic educational dataset containing demographic information and final grades in Portuguese and Math subjects from two Portuguese schools.

A Synthetic Data Component: A synthetically generated portion of the data, created to balance the dataset or represent specific student profiles.

Data Unification and Pre-processing

A direct merge of these sources was not possible as the student identifiers were not shared. Instead, a strategy of intelligent concatenation was employed. The final dataset has undergone a rigorous pre-processing pipeline to make it immediately usable for machine learning tasks:

Advanced Imputation: Missing values were handled using a sophisticated iterative imputation method powered by Gaussian Mixture Models (GMM), ensuring the dataset's integrity.

One-Hot Encoding: All categorical features have been converted to a numerical format.

Feature Scaling: All numerical features have been standardized (using StandardScaler) to have a mean of 0 and a standard deviation of 1, preventing model bias from features with different scales.

The result is a clean, comprehensive dataset ready for modeling.

File Information

Instance

Each row represents a student profile, and the columns are the features and the target.

Feature

Features include aggregated online engagement metrics (e.g., clicks, distinct activities), academic performance (grades, scores), and student demographics (e.g., gender, age band). A key feature indicates the original data source (OULAD, UCI, Synthetic).

Sensitive Information

The dataset contains no Personally Identifiable Information (PII). Demographic information is presented in broad, anonymized categories.

Key Columns:

Target Variable: had_difficulty: The primary target for classification. This binary variable has been engineered from the original final_result column of the OULAD dataset. 1: The student either failed (Fail) or withdrew (Withdrawn) from the course. 0: The student passed (Pass or Distinction). Feature Groups: OULAD Aggregated Features (e.g., oulad_total_cliques, oulad_media_notas): Quantitative metrics summarizing a student's engagement and performance within the VLE. Academic Performance Features (e.g., nota_matematica_harmonizada): Harmonized grades from different data sources. Demographic Features (e.g., gender_*, age_band_*): One-hot encoded columns representing student demographics. Origin Features (e.g., origem_dado_OULAD, origem_dado_UCI): One-hot encoded columns indicating the original source of the data for each row. This allows for source-specific analysis.

(Note: All numerical feature names are post-scaling and may not directly reflect their original names. Please refer to the complete column list for details.)

Acknowledgements

This dataset would not be possible without the original data providers. Please acknowledge them in any work that uses this data:

OULAD Dataset: Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open University Learning Analytics dataset. Scientific Data, 4. https://analyse.kmi.open.ac.uk/open_dataset UCI Student Performance Dataset: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS. https://archive.ics.uci.edu/ml/datasets/student+performance

Inspiration

This dataset is perfect for a variety of predictive modeling tasks. Here are a few ideas to get you started:

Can you build a classification model to predict had_difficulty with high recall? (Minimizing the number of at-risk students we fail to identify).

Which features are the most powerful predictors of student failure or withdrawal? (Feature Importance Analysis).

Can you build separate models for each data origin (origem_dado_*) and compare ...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mark Woychick (2021). Cyclistic_data_visualization [Dataset]. https://www.kaggle.com/markwoychick/cyclistic-data-visualization

Cyclistic_data_visualization

Visualization data set for Google Data Analyst certification, Case Study 1

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 12, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mark Woychick

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.

Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.

Content

I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.

Acknowledgements

Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html

Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy

Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)

Inspiration

This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.

One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.

I am also interested to see what others did with this analysis - what were the findings and insights you found?

Clear search

Close search

Google apps

Main menu

Cyclistic_data_visualization

Context

Content

Acknowledgements

Inspiration

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

Cleaned NHANES 1988-2018

Replication Data for: Trajectories of mental health problems in childhood...

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

Superstore Sales Analysis

Code to Create a Combined British Household Panel Survey and Understanding...

Automatic Number Plate Recognition Data from Bristol, England for Apr-Jun...

A Hybrid Educational Dataset

Context

Data Unification and Pre-processing

File Information

Instance

Feature

Sensitive Information

Acknowledgements

Inspiration

Cyclistic_data_visualization

Visualization data set for Google Data Analyst certification, Case Study 1

Context

Content

Acknowledgements

Inspiration