55 datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
R code
figshare.com
txt
Updated Jun 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5021297.v1
Dataset updated
Jun 5, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Christine Dodge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
d
Replication Data for: Race, gender, and the politics of incivility
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gubitz, Sam (2023). Replication Data for: Race, gender, and the politics of incivility [Dataset]. http://doi.org/10.7910/DVN/ODPNI8
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ODPNI8
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Gubitz, Sam
Description
Use the project file first, then open the cleaning R file to clean the raw data. Then use the R file called OLS analysis to analyze the cleaned data, which was outputted as a .rds file.
A dataset for temporal analysis of files related to the JFK case
zenodo.org
data.niaid.nih.gov
csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Luczak-Roesch; Markus Luczak-Roesch (2020). A dataset for temporal analysis of files related to the JFK case [Dataset]. http://doi.org/10.5281/zenodo.1098568
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1098568
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Markus Luczak-Roesch; Markus Luczak-Roesch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.

The code to derive the dataset is given as follows:

### BEGIN R DATA PROCESSING SCRIPT

library(tesseract)
library(pdftools)

pdfs <- list.files("/home/STAFF/luczakma/RProjects/JFK/data/files/")

meta <- read.csv2("/home/STAFF/luczakma/RProjects/JFK/data/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',')

meta$Doc.Date <- as.character(meta$Doc.Date)

meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),]
for(i in 1:nrow(meta.clean)){
meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])

if(nchar(meta.clean$Doc.Date[i])<10){
meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y")
}

}

meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")

meta.clean <- meta.clean[order(meta.clean$Doc.Date),]

docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F)
for(i in 1:nrow(meta.clean)){
#for(i in 1:3){
pdf_prop <- pdftools::pdf_info(paste0("/home/STAFF/luczakma/RProjects/JFK/data/files/",tolower(gsub("\\s+"," ",gsub(" ","",meta.clean$File.Name[i])))))
tmp_files <- c()
for(k in 1:pdf_prop$pages){
tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k))
}

img_file <- pdftools::pdf_convert(paste0("/home/STAFF/luczakma/RProjects/JFK/data/files/",tolower(gsub("\\s+"," ",gsub(" ","",meta.clean$File.Name[i])))), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)

txt <- ""

for(j in 1:length(img_file)){
extract <- ocr(img_file[j], engine = tesseract("eng"))
#unlink(img_file)
txt <- paste(txt,extract,collapse = " ")
}

docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F)
}

### END R DATA PROCESSING SCRIPT
Renewable Heat Premium Payment Scheme: Heat Pump Monitoring: Cleaned Data,...
datacatalogue.cessda.eu
beta.ukdataservice.ac.uk
Updated Nov 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lowe, R., University College London (2024). Renewable Heat Premium Payment Scheme: Heat Pump Monitoring: Cleaned Data, 2013-2015 [Dataset]. http://doi.org/10.5255/UKDA-SN-8151-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-8151-1
Dataset updated
Nov 28, 2024
Dataset provided by
Department of Energy and Climate Changehttp://www.gov.uk/decc
UCL Energy Institute
Authors
Lowe, R., University College London
Area covered
Great Britain
Variables measured
Families/households, National
Measurement technique
Physical measurements
Description
Abstract copyright UK Data Service and data collection copyright owner.

The need to develop the UK supply for domestic heat pumps (HPs) and to evaluate the empirical performance of HP systems in the field has led to the establishment of two major UK field trials of HPs (the first one took place between 2008 and 2012). These data were generated from the second field trial, established by the Department for Energy and Climate Change (DECC) in conjunction with the Renewable Heat Premium Payment (RHPP) grant scheme, which ran from 2011-2014 (note that the data included here cover the period October 2013-March 2015). Please note that this study contains the cleaned data - a raw version is available under SN 7955. See the RAPID-HPC statement below for information on data quality.

The RHPP policy provided subsidies for private householders, Registered Social Landlords and communities to install renewable heat measures in residential properties. Eligible measures included air and ground-source heat pumps, biomass boilers and solar thermal panels. Around 14,000 heat pumps were installed via this scheme. DECC (now BEIS) funded a detailed monitoring campaign, which covered 700 heat pumps (around 5% of the total). The aim of this monitoring campaign was to provide data to enable an assessment of the efficiencies of the heat pumps and to gain greater insight into their performance. The RHPP scheme was administered by the Energy Savings Trust (EST) who engaged the Buildings Research Establishment (BRE) to run the meter installation and data collection phases of the monitoring program. They collected data from 31 October 2013 to 31 March 2015. RHPP heat pumps were installed between 2009 and 2014. Since the start of the RHPP Scheme, the installation requirements set by MCS standards and processes have been updated. Further information about the RHPP scheme (which has now closed), including statistics, can be found on the Gov.uk Renewable Heat Premium Payment scheme statistics webpage.

DECC contracted the RAPID-HPC to analyse these data. The data provided to RAPID-HPC included physical monitoring data, and metadata describing the features of the heat pump installations and the dwellings in which they were installed. As the analysis has progressed, limitations with the underlying data have been identified. See RAPID-HPC's statement (below). (See also SN 7955 for a raw version of the data.)

RAPID-HPC's Statement on Data Anomalies and Interpretation, February 2016 (covered in the Detailed Analysis of Data Report and the spreadsheets included with this study)
The work of the RAPID-HPC consisted of cleaning the data, selection of sites and data for analysis, analysis, and the development of conclusions and interpretations. The monitoring data and contextual information are imperfect. Discussion of the data limitations are provided in the DECC Detailed analysis of data from heat pumps installed via the Renewable Heat Premium Payment Scheme report on the gov.uk website which is essential to the understanding of this data. RAPID-HPC has used a top-down rules-based approach to identifying data anomalies to clean the data. The advantages of this approach are that it is transparent and replicable and enables analysis of the very large (over 0.5 billion data points) dataset as a whole. It is important to note that the data was collected from domestic heat pumps installed via the RHPP policy. [RAPID-HPC] have not assessed the degree to which the heat pumps assessed are representative of the general sample of domestic heat pumps in the UK. Therefore, results from any analysis undertaken using these data should not be assumed to be representative of any other sample of heat pumps.

Downloading the data - using suitable zip software
Users should note that the download zip file for this study is over 1GB in size. The standard Windows system zip compression software is not able to unzip a file of this size completely and may mean that problems are encountered with some of the data or documentation files. Therefore, it is recommended that users install one of the following software packages in order to unzip the file:
7zip (free open source software for Windows, check 7zip website for Linux/Unix);
WinZip (free to try for Windows and Mac);
iZip (free software for Mac).

Main Topics:

The data cover a number of technical parameters from monitoring heat pump systems (such as flow temperature, heat output) in timeseries form (2-minutely data) for each of the monitored sites. Heat pump efficiencies (such as Seasonal Performance Factor) can be calculated from the variables present.
Bitter Creek Analysis Pedigree Data
catalog.data.gov
s.cnmilf.com
Updated Sep 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). Bitter Creek Analysis Pedigree Data [Dataset]. https://catalog.data.gov/dataset/bitter-creek-analysis-pedigree-data
Explore at:
Dataset updated
Sep 25, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These data sets contain raw and processed data used in for analyses, figures, and tables in the Region 8 Memo: Characterization of chloride and conductivity levels in the Bitter Creek Watershed, WY. However, these data may be used for other analyses alone or in combination with other or new data. These data were used to assess whether chloride levels are naturally high in streams in the Bitter Creek, WY watershed and how chloride concentrations expected to protect 95 percent of aquatic genera in these streams compare to Wyoming’s chloride criteria applicable to the Bitter Creek watershed. Owing to the arid conditions, background conductivity and chloride levels were characterized for surface flow and ground water flow conditions. Natural chloride levels were found to be less than current water quality criteria for Wyoming. Although the report was prepared for USEPA Region 8 and OST, Office of Water, the report will be of interest to the WDEQ, Sweetwater County Conservation District, and the regulated community. No formal metadata standard was used. Pedigree.xlsx contains: 1. NOTES: Description of work and other worksheets. 2. Pedigree_Summary: Source files used to create figures and tables. 3. DataFiles: Data files used in the R code for creating the figures and tables 4. R_Script: Summary of the R scripts. 5. DataDictionary: Data file titles in all data files Folders: _Datasets Data file uploaded to Environmental Dataset Gateway "A list of subfolders: _R: Clean R scripts used to generate document figures and tables _Tables_Figures: Files generated from R script and used in the Region 6 memo R Code and Data: All additional files used for this project, including original files, intermediate files, extra output files, and extra functions the ""_R"" folder stores R scripts for input and output files and an R project file.. Users can open the R project and run R scripts directly from the ""_R"" folder or the XC95 folder by installing R, RStudio, and associated R packages."
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2018 [Dataset]. http://doi.org/10.3886/E105403
Explore at:
Unique identifier
https://doi.org/10.3886/E105403
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 4 release notes:Adds data for 2018Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
r
Street cleaning in City of Yarra
researchdata.edu.au
Updated Oct 2, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Yarra (2019). Street cleaning in City of Yarra [Dataset]. https://researchdata.edu.au/street-cleaning-city-yarra/2980765
Explore at:
Dataset updated
Oct 2, 2019
Dataset provided by
data.gov.au
Authors
City of Yarra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
Our aim is to make Yarra a clean and pleasant place for our residents to live. This data asset has information about sweeping and loose litter removal across residential roads, kerbs and public open spaces within the Yarra municipality. The street cleansing details include cleaning date and time, suburb where the cleaning was done, category of cleaning, volume of litter removed and cleaning duration.\r \r While all due care has been taken to ensure the data asset is accurate and current, Yarra City Council does not warrant that this data is definitive nor free of error and does not accept responsibility for any loss, damage, claim, expense, cost or liability whatsoever arising from reliance upon information provided herein.\r \r Feedback on the data asset - including compliments, complaints and requests for more detail - is welcome.
Data from: Data and code from: A natural polymer material as a pesticide...
s.cnmilf.com
agdatacommons.nal.usda.gov
+2more
Updated Nov 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2024). Data and code from: A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-and-code-from-a-natural-polymer-material-as-a-pesticide-adjuvant-for-mitigating-off-t
Explore at:
Dataset updated
Nov 2, 2024
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
This dataset contains all data and code required to clean the data, fit the models, and create the figures and tables for the laboratory experiment portion of the manuscript:Kannan, N., Q. D. Read, and W. Zhang. 2024. A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health. Heliyon, in press. https://doi.org/10.1016/j.heliyon.2024.e35510.In this dataset, we archive results from several laboratory and field trials testing different adjuvants (spray additives) that are intended to reduce particle drift, increase particle size, and slow down the particles from pesticide spray nozzles. We fit statistical models to the droplet size and speed distribution data and statistically compare different metrics between the adjuvants (sodium alginate, polyacrylamide [PAM], and control without any adjuvants). The following files are included:RawDataPAMsodAlgOxfLsr.xlsx: Raw data for primary analysesOrganizedDataPaperRevision20240614.xlsx: Raw data to produce density plots presented in Figs. 8 and 9raw_data_readme.md: Markdown file with description of the raw data filesR_code_supplement.R: All R code required to reproduce primary analysesR_code_supplement2.R: R code required to produce density plots presented in Figs. 8 and 9Intermediate R output files are also included so that tables and figures can be recreated without having to rerun the data preprocessing, model fitting, and posterior estimation steps:pam_cleaned.RData: Data combined into clean R data frames for analysisvelocityscaledlogdiamfit.rds: Fitted brms model object for velocitylnormfitreduced.rds: Fitted brms model object for diameter distributionemm_con_velo_diam_draws.RData: Posterior distributions of estimated marginal means for velocityemm_con_draws.RData: Posterior distributions of estimated marginal means for diameter distributionThe following software and package versions were used:R version 4.3.1CmdStan version 2.33.1R packages:brms version 2.20.5cmdstanr version 0.5.3fitdistrplus version 1.1-11tidybayes version 3.0.4emmeans version 1.8.9
Z
Dataset and R script for the analysis in the article "Food waste between...
data.niaid.nih.gov
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Righi (2024). Dataset and R script for the analysis in the article "Food waste between environmental education, peers, and family influence. Insights from primary school students in Northern Italy", Journal of Cleaner Production [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7101522
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Simone Piras
Claudia Giordano
Marco Setti
Federico Banchelli
Simone Righi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We hereby publish the dataset (with metadata) and the R script (R Core team, 2018) used for implementing the analysis presented in the paper "Food waste between environmental education, peers, and family influence. Insights from primary school students in Northern Italy", Journal of Cleaner Production (Piras et al., 2023). The dataset is provided in csv format with semicolons as separators and "NA" for missing data. The dataset includes all the variables used in at least one of the models presented in the paper, either in the main text or in the Supplementary Material. Other variables gathered by means of the questionnaires included as Supplementary Material of the paper have been removed. The dataset includes inputted values for missing data on independent variables. These were inputted using two approaches: last observation carried forward (LOCF) - preferred when possible - and last observation carried backward (LOCB). The metadata are presented as a PDF file.
d
Replication Data for: The Political Violence Cycle
search.dataone.org
dataverse.harvard.edu
+1more
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Little, Andrew (2023). Replication Data for: The Political Violence Cycle [Dataset]. https://search.dataone.org/view/sha256%3Aa122ebe7657e656503663c226adcd8f33fcbae3c11dfe12f3c4edcf0f4cf0f46
Explore at:
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Little, Andrew
Description
Replication files and code for "The Political Violence Cycle", by S.P. Harish and Andrew Little, APSR. * Input files id & q-wide.dta scad.dta scad_lar.dta scad_nelda_basic.dta scad_nelda_basic_year.dta scad_nelda_basic_yearmonth.dta * Output files used to make graphs 439year.csv 452year.csv 475year.csv 501year.csv dist2elec30antigovt2.csv dist2elec30contest.csv dist2elec30NOcontest.csv dist2elec45progovt.csv dist2elec180contest.csv dist2elec180NOcontest.csv dist2elec365contest.csv dist2elec365NOcontest.csv * Code to clean data, create files to input to R for final graphs scad_prep.do nelda_prep.do scad_nelda_merge.do gengraphdata.do timing_replication.do (this runs the other do files in the correct order) * Code to produce final graphs code_for_figures.R (all but figure 5) simulated_comparative_statics_and_fig5.R (figure 5 and graphs in supplemental information)
r
Clean Energy Regulator FOI Disclosure Log
researchdata.edu.au
data.gov.au
+1more
Updated Dec 12, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Finance (2013). Clean Energy Regulator FOI Disclosure Log [Dataset]. https://researchdata.edu.au/clean-energy-regulator-disclosure-log/2984875
Explore at:
Dataset updated
Dec 12, 2013
Dataset provided by
data.gov.au
Authors
Department of Finance
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
This dataset does not contain any resources hosted on data.gov.au. It provides a link to the location of the Clean Energy Regulator Freedom of Information (FOI) disclosure log to aide in information and data discovery. You can find the FOI Disclosure log here and the Agency's Information Publication Scheme here.\r \r The data.gov.au team is not responsible for the contents of the above linked pages.
n
Scripts for cleaning and analysis of data from SOFC experiment on...
4tu.edu.hpc.n-helix.com
zip
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berend van Veldhuizen (2024). Scripts for cleaning and analysis of data from SOFC experiment on inclination test-bench. [Dataset]. http://doi.org/10.4121/ed0a0cff-7af9-4d3a-baf7-aab5efe39bd1.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/ed0a0cff-7af9-4d3a-baf7-aab5efe39bd1.v1
Dataset updated
Aug 27, 2024
Dataset provided by
4TU.ResearchData
Authors
Berend van Veldhuizen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2023
Dataset funded by
European Commission
Description
This data set contains the scripts used for importing, trimming, cleaning, analysing, and plotting a large dataset of inclination experiments with an SOFC module. The measurement data is confidential, so it could not be published alongside the scripts. One row of dummy input data is published to illustrate the structure of the analysed data. The analysis is used for the journal paper "Experimental Evaluation of a Solid Oxide Fuel Cell System Exposed to Inclinations and Accelerations by Ship Motions".
The scripts contain:
- A script that reads the data, removes unusable data and transforms into analysable dataframes (Clean and trim.R)
- Two files to make a wide variety of plots (Plotting.R and Specificplots.R)
- A file data does a Gaussian Progress regression to estimate the degradation rate (Degradation estimation.R)
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
search.datacite.org
openicpsr.org
Updated 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2019). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1991-2017 [Dataset]. http://doi.org/10.3886/e103500v5
Explore at:
Unique identifier
https://doi.org/10.3886/e103500v5
Dataset updated
2019
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Jacob Kaplan
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.

Version 5 release notes:
Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. Version 4 release notes:
Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent.
Made the 'population' column which is the total population in that agency.

Version 3 release notes:
Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.

Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency.
Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.).

The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), changed the name of some UCR offense codes (e.g. from "agg asslt" to "aggravated assault"), made all character values lower case, reordered columns. I also added state, county, and place FIPS code from the LEAIC (crosswalk) and generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
o
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
openicpsr.org
Updated May 18, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1991-2019 [Dataset]. http://doi.org/10.3886/E103500V7
Explore at:
Unique identifier
https://doi.org/10.3886/E103500V7
Dataset updated
May 18, 2018
Dataset provided by
University of Pennsylvania
Authors
Jacob Kaplan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1991 - 2019
Area covered
United States
Description
!!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!Version 8 release notes:Adds 2019 dataVersion 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
4
Research compendium for FAIR Sharing is Caring poster
data.4tu.nl
zip
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bjørn Peare Bartholdy; Amanda Henry; Femke H. Reidsma (2023). Research compendium for FAIR Sharing is Caring poster [Dataset]. http://doi.org/10.4121/6402242a-bf1e-4d7b-8bf1-0955522bca2e.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/6402242a-bf1e-4d7b-8bf1-0955522bca2e.v1
Dataset updated
Sep 20, 2023
Dataset provided by
4TU.ResearchData
Authors
Bjørn Peare Bartholdy; Amanda Henry; Femke H. Reidsma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# FAIR Sharing is Caring

Poster presented at ESHE 2023.

Contents:

- Analysis
+ data retrieval: `data-raw/DATASET.R`
+ data cleaning: `scripts/data-cleaning.R`
+ processed survey data: `data/paleoanth-clean.csv`
+ survey questions: `data/survey-questions.csv`

- Report
+ rendered: `poster-analysis.html` or https://bbartholdy.github.io/eshe2023-data-sharing/poster-analysis.html
+ source code: `poster-analysis.qmd`
+ references cited: `references.bib`
Oil Cleaner Import Data of Triple R Co Limited Exporter to USA
seair.co.in
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2024). Oil Cleaner Import Data of Triple R Co Limited Exporter to USA [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 22, 2024
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
H
Replication Data for: "False Front: The Failed Promise of Presidential Power...
dataverse.harvard.edu
Updated Nov 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenneth Lowande (2024). Replication Data for: "False Front: The Failed Promise of Presidential Power in a Polarized Age" [Dataset]. http://doi.org/10.7910/DVN/JMBGLE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/JMBGLE
Dataset updated
Nov 18, 2024
Dataset provided by
Harvard Dataverse
Authors
Kenneth Lowande
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
-This repository reproduces figures and results reported in "False Front" -Results discussed in-text are reproduced in the figure file for the nearest figure -The replication R files can either be run on their own, or all ran at once via make.R -Most data files are summarized versions of the underlying dataset. A clean version of the executive action dataset is also located in this repository.
Monitoring COVID-19 Impact on Refugees in Ethiopia: High-Frequency Phone...
microdata.unhcr.org
datacatalog.ihsn.org
+2more
Updated Jul 5, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank-UNHCR Joint Data Center on Forced Displacement (JDC) (2022). Monitoring COVID-19 Impact on Refugees in Ethiopia: High-Frequency Phone Survey of Refugees 2020 - Ethiopia [Dataset]. https://microdata.unhcr.org/index.php/catalog/704
Explore at:
Dataset updated
Jul 5, 2022
Dataset provided by
United Nations High Commissioner for Refugeeshttp://www.unhcr.org/
World Bankhttp://worldbank.org/
Authors
World Bank-UNHCR Joint Data Center on Forced Displacement (JDC)
Time period covered
2020
Area covered
Ethiopia
Description
Abstract

The high-frequency phone survey of refugees monitors the economic and social impact of and responses to the COVID-19 pandemic on refugees and nationals, by calling a sample of households every four weeks. The main objective is to inform timely and adequate policy and program responses. Since the outbreak of the COVID-19 pandemic in Ethiopia, two rounds of data collection of refugees were completed between September and November 2020. The first round of the joint national and refugee HFPS was implemented between the 24 September and 17 October 2020 and the second round between 20 October and 20 November 2020.

Analysis unit

Household

Kind of data

Sample survey data [ssd]

Sampling procedure

The sample was drawn using a simple random sample without replacement. Expecting a high non-response rate based on experience from the HFPS-HH, we drew a stratified sample of 3,300 refugee households for the first round. More details on sampling methodology are provided in the Survey Methodology Document available for download as Related Materials.

Mode of data collection

Computer Assisted Telephone Interview [cati]

Research instrument

The Ethiopia COVID-19 High Frequency Phone Survey of Refugee questionnaire consists of the following sections:

Interview Information

Household Roster

Camp Information

Knowledge Regarding the Spread of COVID-19

Behaviour and Social Distancing - Access to Basic Services

Employment

Income Loss

Coping/Shocks

Social Relations

Food Security

Aid and Support/ Social Safety Nets.

A more detailed description of the questionnaire is provided in Table 1 of the Survey Methodology Document that is provided as Related Materials. Round 1 and 2 questionnaires available for download.

Cleaning operations

DATA CLEANING At the end of data collection, the raw dataset was cleaned by the Research team. This included formatting, and correcting results based on monitoring issues, enumerator feedback and survey changes. Data cleaning carried out is detailed below.

Variable naming and labeling: • Variable names were changed to reflect the lowercase question name in the paper survey copy, and a word or two related to the question. • Variables were labeled with longer descriptions of their contents and the full question text was stored in Notes for each variable. • “Other, specify” variables were named similarly to their related question, with “_other” appended to the name. • Value labels were assigned where relevant, with options shown in English for all variables, unless preloaded from the roster in Amharic.

Variable formatting: • Variables were formatted as their object type (string, integer, decimal, time, date, or datetime). • Multi-select variables were saved both in space-separated single-variables and as multiple binary variables showing the yes/no value of each possible response. • Time and date variables were stored as POSIX timestamp values and formatted to show Gregorian dates. • Location information was left in separate ID and Name variables, following the format of the incoming roster. IDs were formatted to include only the variable level digits, and not the higher-level prefixes (2-3 digits only.)
• Only consented surveys were kept in the dataset, and all personal information and internal survey variables were dropped from the clean dataset. • Roster data is separated from the main data set and kept in long-form but can be merged on the key variable (key can also be used to merge with the raw data). • The variables were arranged in the same order as the paper instrument, with observations arranged according to their submission time.

Backcheck data review: Results of the backcheck survey are compared against the originally captured survey results using the bcstats command in Stata. This function delivers a comparison of variables and identifies any discrepancies. Any discrepancies identified are then examined individually to determine if they are within reason.

Data appraisal

The following data quality checks were completed: • Daily SurveyCTO monitoring: This included outlier checks, skipped questions, a review of “Other, specify”, other text responses, and enumerator comments. Enumerator comments were used to suggest new response options or to highlight situations where existing options should be used instead. Monitoring also included a review of variable relationship logic checks and checks of the logic of answers. Finally, outliers in phone variables such as survey duration or the percentage of time audio was at a conversational level were monitored. A survey duration of close to 15 minutes and a conversation-level audio percentage of around 40% was considered normal. • Dashboard review: This included monitoring individual enumerator performance, such as the number of calls logged, duration of calls, percentage of calls responded to and percentage of non-consents. Non-consent reason rates and attempts per household were monitored as well. Duration analysis using R was used to monitor each module's duration and estimate the time required for subsequent rounds. The dashboard was also used to track overall survey completion and preview the results of key questions. • Daily Data Team reporting: The Field Supervisors and the Data Manager reported daily feedback on call progress, enumerator feedback on the survey, and any suggestions to improve the instrument, such as adding options to multiple choice questions or adjusting translations. • Audio audits: Audio recordings were captured during the consent portion of the interview for all completed interviews, for the enumerators' side of the conversation only. The recordings were reviewed for any surveys flagged by enumerators as having data quality concerns and for an additional random sample of 2% of respondents. A range of lengths were selected to observe edge cases. Most consent readings took around one minute, with some longer recordings due to questions on the survey or holding for the respondent. All reviewed audio recordings were completed satisfactorily. • Back-check survey: Field Supervisors made back-check calls to a random sample of 5% of the households that completed a survey in Round 1. Field Supervisors called these households and administered a short survey, including (i) identifying the same respondent; (ii) determining the respondent's position within the household; (iii) confirming that a member of the the data collection team had completed the interview; and (iv) a few questions from the original survey.
Data from: Designing data science workshops for data-intensive environmental...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
Montana State University
California State Polytechnic University
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.