18 datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
o
Messy data for data cleaning exercise - Dataset - openAFRICA
open.africa
Updated Oct 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Messy data for data cleaning exercise - Dataset - openAFRICA [Dataset]. https://open.africa/dataset/messy-data-for-data-cleaning-exercise
Explore at:
Dataset updated
Oct 6, 2021
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A messy data for demonstrating "how to clean data using spreadsheet". This dataset was intentionally formatted to be messy, for the purpose of demonstration. It was collated from here - https://openafrica.net/dataset/historic-and-projected-rainfall-and-runoff-for-4-lake-victoria-sub-regions
d
Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop
search.dataone.org
borealisdata.ca
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Costanzo, Lucia; Jadon, Vivek (2024). Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop [Dataset]. http://doi.org/10.5683/SP3/FF6AI9
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/FF6AI9
Dataset updated
Jul 31, 2024
Dataset provided by
Borealis
Authors
Costanzo, Lucia; Jadon, Vivek
Description
Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.
Global import data of Clean,excel
volza.com
csv
Updated Mar 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volza FZ LLC (2025). Global import data of Clean,excel [Dataset]. https://www.volza.com/imports-india/india-import-data-of-clean-excel-from-italy
Explore at:
csvAvailable download formats
Dataset updated
Mar 7, 2025
Dataset provided by
Volza
Authors
Volza FZ LLC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Count of importers, Sum of import value, 2014-01-01/2021-09-30, Count of import shipments
Description
23656 Global import shipment records of Clean,excel with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Global exporters importers-export import data of Clean excel
volza.com
csv
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volza FZ LLC (2025). Global exporters importers-export import data of Clean excel [Dataset]. https://www.volza.com/trade-data-global/global-exporters-importers-export-import-data-of-clean+excel
Explore at:
csvAvailable download formats
Dataset updated
May 31, 2025
Dataset provided by
Volza
Authors
Volza FZ LLC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Count of exporters, Count of importers, Count of shipments, Sum of export import value
Description
9130 Global exporters importers export import shipment records of Clean excel with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Global export data of Clean,excel
volza.com
csv
Updated Sep 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volza FZ LLC (2025). Global export data of Clean,excel [Dataset]. https://www.volza.com/exports-india/india-export-data-of-clean-excel-to-saudi-arabia
Explore at:
csvAvailable download formats
Dataset updated
Sep 7, 2025
Dataset provided by
Volza
Authors
Volza FZ LLC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Count of exporters, Sum of export value, 2014-01-01/2021-09-30, Count of export shipments
Description
9686 Global export shipment records of Clean,excel with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
c
Electrification of Heat Demonstration Project: Heat Pump Performance...
datacatalogue.cessda.eu
beta.ukdataservice.ac.uk
Updated Dec 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Energy Systems Catapult (2024). Electrification of Heat Demonstration Project: Heat Pump Performance Cleansed Data, 2020-2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-9050-2
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-9050-2
Dataset updated
Dec 20, 2024
Authors
Energy Systems Catapult
Time period covered
Nov 1, 2020 - Sep 29, 2023
Area covered
United Kingdom
Variables measured
Families/households, Subnational
Measurement technique
Measurements and tests
Description
Abstract copyright UK Data Service and data collection copyright owner.

The heat pump monitoring datasets are a key output of the Electrification of Heat Demonstration (EoH) project, a government-funded heat pump trial assessing the feasibility of heat pumps across the UK’s diverse housing stock. These datasets are provided in both cleansed and raw form and allow analysis of the initial performance of the heat pumps installed in the trial. From the datasets, insights such as heat pump seasonal performance factor (a measure of the heat pump's efficiency), heat pump performance during the coldest day of the year, and half-hourly performance to inform peak demand can be gleaned.

For the second edition (December 2024), the data were updated to include cleaned performance data collected between November 2020 and September 2023. The only documentation currently available with the study is the Excel data dictionary. Reports and other contextual information can be found on the Energy Systems Catapult website.

The EoH project was funded by the Department of Business, Energy and Industrial Strategy. From 2023, it is covered by the new Department for Energy Security and Net Zero.

Data availability

This study comprises the open-access cleansed data from the EoH project and a summary dataset, available in four zipped files (see the 'Access Data' tab). Users must download all four zip files to obtain the full set of cleansed data and accompanying documentation.

When unzipped, the full cleansed data comprises 742 CSV files. Most of the individual CSV files are too large to open in Excel. Users should ensure they have sufficient computing facilities to analyse the data.

The UKDS also holds an accompanying study, SN 9049 Electrification of Heat Demonstration Project: Heat Pump Performance Raw Data, 2020-2023, which is available only to registered UKDS users. This contains the raw data from the EoH project. Since the data are very large, only the summary dataset is available to download; an order must be placed for FTP delivery of the remaining raw data. Other studies in the set include SN 9209, which comprises 30-minute interval heat pump performance data, and SN 9210, which includes daily heat pump performance data.

The Python code used to cleanse the raw data and then perform the analysis is accessible via the Energy Systems Catapult Github

Main Topics:

Heat Pump Performance across the BEIS funded heat pump trial, The Electrification of Heat (EoH) Demonstration Project. See the documentation for data contents.
popular baby names with data cleaning
kaggle.com
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Real Sourabh Singhal (2023). popular baby names with data cleaning [Dataset]. https://www.kaggle.com/datasets/realsourabhsinghal/popular-baby-names-with-data-cleaning/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 11, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Real Sourabh Singhal
Description
It completely data clean excel file to attain accurate data analysis with proper visualization
n
Data from: Designing data science workshops for data-intensive environmental...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
California State Polytechnic University
Montana State University
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
A
Low-Income Energy Affordability Data (LEAD) Tool
data.amerigeoss.org
datadiscoverystudio.org
+1more
csv, pdf, xls, xlsb
Updated Jul 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States[old] (2019). Low-Income Energy Affordability Data (LEAD) Tool [Dataset]. https://data.amerigeoss.org/vi/dataset/clean-energy-for-low-income-communities-accelerator-energy-data-profiles-2fffb
Explore at:
csv, xls, pdf, xlsbAvailable download formats
Dataset updated
Jul 29, 2019
Dataset provided by
United States[old]
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABOUT THIS TOOL:

The Better Building’s Clean Energy for Low Income Communities Accelerator (CELICA) was launched in 2016 to help state and local partners across the nation meet their goals for increasing uptake of energy efficiency and renewable energy technologies in low and moderate income communities. As a part of the Accelerator, DOE created this Low-Income Energy Affordability Data (LEAD) Tool to assist partners with understanding their LMI community characteristics. This can be utilized for low income and moderate income energy policy and program planning, as it provides interactive state, county and city level worksheets with graphs and data including number of households at different income levels and numbers of homeowners versus renters. It provides a breakdown based on fuel type, building type, and construction year. It also provides average monthly energy expenditures and energy burden (percentage of income spent on energy).

HOW TO USE:

The LEAD tool can be used to support program design and goal setting, and they can be paired with other data to improve LMI community energy benchmarking and program evaluation. Datasets are available for all 50 states, census divisions, and tract levels. You will have to enable macros in MS Excel to interact with the data. A description of each of the files and what states are included in each U.S. Census Division can be found in the file "DESCRIPTION OF FILES".

For more information, visit: https://betterbuildingsinitiative.energy.gov/accelerators/clean-energy-low-income-communities
B
To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...
borealisdata.ca
open.library.ubc.ca
Updated Feb 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahram Yarmand (2019). To Estimate and Optimize the Source of Drinking Water for Metro Vancouver until 2040 [Dataset]. http://doi.org/10.5683/SP2/6KU4I7
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/6KU4I7
Dataset updated
Feb 28, 2019
Dataset provided by
Borealis
Authors
Shahram Yarmand
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 2017 - Nov 2017
Area covered
Metro Vancouver
Description
The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040. The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).

Clean Label Ingredients Market Size, Share, Growth Analysis, By Form(Powder,...

skyquestt.com

Updated Jan 15, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

SkyQuest Technology (2024). Clean Label Ingredients Market Size, Share, Growth Analysis, By Form(Powder, Liquid, Others), By Type(Natural Colors, Natural Flavor, Fruit and Vegetable ingredient, Starch and Sweeteners), By Application(Food, Pet Food, Dairy, Non-Dairy), By Distribution Channel(B2B, B2C), By Region - Industry Forecast 2024-2031 [Dataset]. https://www.skyquestt.com/report/clean-label-ingredients-market

Explore at:

Dataset updated

Jan 15, 2024

Dataset authored and provided by

SkyQuest Technology

License

https://www.skyquestt.com/privacy/https://www.skyquestt.com/privacy/

Time period covered

2024 - 2031

Area covered

Global

Description

Global Clean label ingredients Market size was valued at USD 47.10 Billion in 2022 and is poised to grow from USD 50.17 Billion in 2023 to USD 88.03 Billion by 2031, at a CAGR of 6.5% during the forecast period (2024-2031).

Report Metric	Details
Market size value in 2022	USD 47.10 Billion
Market size value in 2023	USD 50.17 Billion
Market size value in 2031	USD 88.03 Billion
Forecast Year	2024-2031
Growth Rate (CAGR)	6.5%
Segments Covered	Form Dry, and Liquid Type Flavors, Colorants, Preservatives, Emulsifier, Stabilizer, and Thickeners (EST) & Others
Largest Market	North America
Fastest Growing Market	Asia Pacific

g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
Explore at:
Unique identifier
https://doi.org/10.3886/E105403V3
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
A
‘Cardiovascular diseases dataset (clean)’ analyzed by Analyst-2
analyst-2.ai
Updated Mar 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Cardiovascular diseases dataset (clean)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-cardiovascular-diseases-dataset-clean-cdcb/latest
Explore at:
Dataset updated
Mar 15, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Cardiovascular diseases dataset (clean)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aiaiaidavid/cardio-data-dv13032020 on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Description of the data set

This data set is a cleaned up copy of cardio_train.csv which can be found at:

https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

The original data set has been analyzed with Excel, correcting negative values, and removing outliers.

A number of features in the dataset are used to predict the presence or absence of a cardiovascular disease.

Below is a description of the features:

AGE: integer (years of age) HEIGHT: integer (cm) WEIGHT: integer (kg) GENDER: categorical (1: female, 2: male) AP_HIGH: systolic blood pressure, integer AP_LOW: diastolic blood pressure, integer CHOLESTEROL: categorical (1: normal, 2: above normal, 3: well above normal) GLUCOSE: categorical (1: normal, 2: above normal, 3: well above normal) SMOKE: categorical (0: no, 1: yes) ALCOHOL: categorical (0: no, 1: yes) PHYSICAL_ACTIVITY: categorical (0: no, 1: yes)

And the target variable:

CARDIO_DISEASE: categorical (0: no, 1: yes)

--- Original source retains full ownership of the source dataset ---
ENTSO-E Hydropower modelling data (PECD) in CSV format
zenodo.org
csv
Updated Aug 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matteo De Felice; Matteo De Felice (2020). ENTSO-E Hydropower modelling data (PECD) in CSV format [Dataset]. http://doi.org/10.5281/zenodo.3950048
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3950048
Dataset updated
Aug 14, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matteo De Felice; Matteo De Felice
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PECD Hydro modelling

This repository contains a more user-friendly version of the Hydro modelling data released by ENTSO-E with their latest Seasonal Outlook.

The original URLs:

The zipped file: https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/seasonal/SOR2020/data/Hydro.zip

The documentation file (v 1.0): https://eepublicdownloads.blob.core.windows.net/public-cdn-container/clean-documents/sdc-documents/MAF/2019/Hydropower_Modelling_New_database_and_methodology.pdf

The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019

As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.

Data description

The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.

In this repository you can find 5 CSV files:

PECD-hydro-capacities.csv: installed capacities

PECD-hydro-weekly-inflows.csv: weekly inflows for reservoir and open-loop pumping

PECD-hydro-daily-ror-generation.csv: daily run-of-river generation

PECD-hydro-weekly-reservoir-min-max-generation.csv: minimum and maximum weekly reservoir generation

PECD-hydro-weekly-reservoir-min-max-levels.csv: weekly minimum and maximum reservoir levels

Capacities

The file PECD-hydro-capacities.csv contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 5 to 7, columns from 2 to 5

sheet Reservoir, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Open Loop, rows from 5 to 7, columns from 1 to 3

sheet Pump storage - Closed Loop, rows from 5 to 7, columns from 1 to 3

Inflows

The file PECD-hydro-weekly-inflows.csv contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 16 to 51

sheet Pump storage - Open Loop, rows from 13 to 66, columns from 16 to 51

Daily run-of-river

The file PECD-hydro-daily-ror-generation.csv contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Run-of-River and pondage, rows from 13 to 378, columns from 15 to 51

Miminum and maximum reservoir generation

The file PECD-hydro-weekly-reservoir-min-max-generation.csv contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 13 to 66, columns from 196 to 231

sheet Reservoir, rows from 13 to 66, columns from 232 to 267

Minimum/Maximum reservoir levels

The file PECD-hydro-weekly-reservoir-min-max-levels.csv contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM from the following sections:

sheet Reservoir, rows from 14 to 66, column 12

sheet Reservoir, rows from 14 to 66, column 13

CHANGELOG

[2020/07/17] Added maximum generation for the reservoir
Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
openicpsr.org
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
o
Hourly data of a building located at the Campus of University of Girona....
explore.openaire.eu
zenodo.org
Updated Sep 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joaquim Massana I Raurich; Carles Pous; Llorenç Burgas I Nadal; Joaquim Melendez; Joan Colomer (2019). Hourly data of a building located at the Campus of University of Girona. Data were collected from 2011 to 2014. [Dataset]. http://doi.org/10.5281/zenodo.3461727
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3461727
Dataset updated
Sep 26, 2019
Authors
Joaquim Massana I Raurich; Carles Pous; Llorenç Burgas I Nadal; Joaquim Melendez; Joan Colomer
Area covered
Girona
Description
This dataset supplements the journal paper: "Short-term load forecasting for non-residential buildings contrasting artificial occupancy attributes". Authors: J. Massana, C. Pous et al. Journal: Energy and Buildings, 2015, vol. 130, p. 519-531. The paper is accessible in the below link: https://doi.org/10.1016/j.enbuild.2016.08.081 Description: Each excel file contains hourly data of one building located at the Campus of Unviersity of Gironal. Data were collected from 2011 to 2014. Column information for the excel files: - Hora: hour of the day (0, 1... 23). - Dia: day of the month (1, 2... 31). - Mes: month (1,2... 12) - Any: year (2011... 2014). - Dia_set: day of the week (1,2... 7). - Temp: Outdoor temperature (oC). - Perfil_dia: daily profile (school day, non-school day, examination day, school-leaving examination day, August day, holiday and weekend day and, finally, Easter and Christmas holiday). - Indicador X.X: occupancy indicators, as described in the paper. - Cons: electrical consumption (kWh)
n
Data from: Low-cost, local production of a safe and effective disinfectant...
data.niaid.nih.gov
datadryad.org
zip
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Naranjo-Soledad; Logan Smesrud; Siva RS Bandaru; Dana Hernandez; Meire Mehare; Sara Mahmoud; Vijay Matange; Bakul Rao; Chandana N; Paige Balcom; David Omole; Cesar Alvarez-Mejia; Varinia Lopez-Ramirez; Ashok Gadgil (2024). Low-cost, local production of a safe and effective disinfectant for resource-constrained communities [Dataset]. http://doi.org/10.5061/dryad.2547d7wz5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2547d7wz5
Dataset updated
Jun 21, 2024
Dataset provided by
University of California, Berkeley
Tecnológico Nacional de México
Indian Institute of Technology Bombay
Gulu University
Indian Institute of Technology Jodhpur
VINYS Architects
Covenant University
Authors
Andrea Naranjo-Soledad; Logan Smesrud; Siva RS Bandaru; Dana Hernandez; Meire Mehare; Sara Mahmoud; Vijay Matange; Bakul Rao; Chandana N; Paige Balcom; David Omole; Cesar Alvarez-Mejia; Varinia Lopez-Ramirez; Ashok Gadgil
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Improved hygiene depends on the accessibility and availability of effective disinfectant solutions. These disinfectant solutions are unavailable to many communities worldwide due to resource limitations, among other constraints. Safe and effective chlorine-based disinfectants can be produced via simple electrolysis of salt water, providing a low-cost and reliable option for on-site, local production of disinfectant solutions to improve sanitation and hygiene. This study reports on a system (herein called “Electro-Clean”) that can produce concentrated solutions of hypochlorous acid (HOCl) using readily available, low-cost materials. With just table salt, water, graphite welding rods, and a DC power supply, the Electro-Clean system can safely produce HOCl solutions (~1.5 liters) of up to 0.1% free chlorine (i.e.,1000 ppm) in less than two hours at low potential (5 V DC) and modest current (~5 A). Rigorous testing of free chlorine production and durability of the Electro-Clean system components, described here, has been verified to work in multiple locations around the world, including microbiological tests conducted in India and Mexico to confirm the biocidal efficacy of the Electro-Clean solution as a surface disinfectant. Cost estimates are provided for making HOCl locally with this method in the USA, India, and Mexico. Findings indicate that Electro-Clean is an affordable alternative to off-the-shelf commercial chlorinator systems in terms of first costs (or capital costs), and cost-competitive relative to the unit cost of the disinfectant produced. By minimizing dependence on supply chains and allowing for local production, the Electro-Clean system has the potential to improve public health by addressing the need for disinfectant solutions in resource-constrained communities. Methods We conducted chemical experiments in a laboratory setting, performing each experiment in triplicate unless otherwise specified. The dataset presented here includes the raw data from these experiments. We used Excel to record the data and calculate the average and standard deviation. The file names correspond to the figures or tables in the manuscript or the Supporting Information Appendices. Detailed descriptions of the experimental methods can be found in the main manuscript.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:

154 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.5683/SP3/ZCN177

Dataset updated

Jul 13, 2023

Dataset provided by

Borealis

Authors

Rong Luo

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Clear search

Close search

Google apps

Main menu

Data Cleaning Sample

Messy data for data cleaning exercise - Dataset - openAFRICA

Navigating Stats Can Data & Scrubbing Data Clean with Excel Workshop

Global import data of Clean,excel

Global exporters importers-export import data of Clean excel

Global export data of Clean,excel

Electrification of Heat Demonstration Project: Heat Pump Performance...

popular baby names with data cleaning

Data from: Designing data science workshops for data-intensive environmental...

Low-Income Energy Affordability Data (LEAD) Tool

To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...

Clean Label Ingredients Market Size, Share, Growth Analysis, By Form(Powder,...

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

‘Cardiovascular diseases dataset (clean)’ analyzed by Analyst-2

Description of the data set

ENTSO-E Hydropower modelling data (PECD) in CSV format

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

Hourly data of a building located at the Campus of University of Girona....

Data from: Low-cost, local production of a safe and effective disinfectant...

Data Cleaning Sample