100+ datasets found

B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
Data cleaning EVI2
figshare.com
txt
Updated May 13, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geraldine Klarenberg (2019). Data cleaning EVI2 [Dataset]. http://doi.org/10.6084/m9.figshare.5327527.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5327527.v1
Dataset updated
May 13, 2019
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Geraldine Klarenberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Scripts to clean EVI2 data obtained from the VIP lab (University of Arizona) website (https://vip.arizona.edu/about.php and https://vip.arizona.edu/viplab_data_explorer.php). Data obtained in 2012.- outlier detection and removal/replacement- alignment of 2 periodsThe manuscript detailing the methods and resulting data sets has been accepted for publication in Nature Scientific Data (05/11/2019).Instructions: use the R Markdown html file for instructions!Code last manipulated and tested in R 3.4.3 ("Kite-Eating Tree")
q
Writing Clean Code in R Workshop
qubeshub.org
Updated Oct 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
Explore at:
Dataset updated
Oct 15, 2019
Dataset provided by
QUBES
Authors
Max Joseph; Leah Wasser
Description
When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.
R/r custom clean llc Import Company US
seair.co.in
Updated Jan 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2018). R/r custom clean llc Import Company US [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Jan 11, 2018
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
f
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
figshare
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
R Code of Simulations
catalog.data.gov
cloud.csiss.gmu.edu
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). R Code of Simulations [Dataset]. https://catalog.data.gov/dataset/r-code-of-simulations
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The sims zip file contains R code and accompanying files needed to run the R code. Overall this code demonstrates the R code used in the study is fully functional, documented, and reproducible and that this code could reproduce the simulation results from the study with sufficient computing time. The code as presented is for a single simulated dataset and will produce estimates and confidence intervals produced by all the methods used within the study when run on that one dataset. This dataset is associated with the following publication: Nethery, R., F. Mealli, J. Sacks, and F. Dominici. Evaluation of the Health Impacts of the 1990 Clean Air Act Amendments Using Causal Inference and Machine Learning. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION. Taylor & Francis Group, London, UK, 1-12, (2020).
d
Replication Data for: realdata
search.dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xu, Ningning (2023). Replication Data for: realdata [Dataset]. http://doi.org/10.7910/DVN/AFZZVP
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AFZZVP
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Xu, Ningning
Description
(1) dataandpathway_eisner.R, dataandpathway_bordbar.R, dataandpathway_taware.R and dataandpathway_almutawa.R: functions and codes to clean the realdata sets and obtain the annotation databases, which are save as .RData files in sudfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (2) FWER_excess.R: functions to show the inflation of FWER when integrating multiple annotation databases and to generate Table 1. (3) data_info.R: code to obtain Table 2 and Table 3. (4) rejections_perdataset.R and triangulartable.R: functions to generate Table 4. The runing time of rejections_perdataset.R is 7 hours around, we thus save the corresponding results as res_eisner.RData, res_bordbar.RData, res_taware.RData and res_almutawa.RData in subfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (5) pathwaysizerank.R: code for generating Figure 4 based on res_eisner.RData from (h). (6) iterationandtime_plot.R: code for generating Figure 5 based on “Al-Mutawa” data. The code is really time-consuming, nearly 5 days, we thus save the corresponding results and plot them in the main manuscript by pgfplot.
Z
A dataset for temporal analysis of files related to the JFK case
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luczak-Roesch, Markus (2020). A dataset for temporal analysis of files related to the JFK case [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1042153
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Luczak-Roesch, Markus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.

The code to derive the dataset is given as follows:

BEGIN R DATA PROCESSING SCRIPT

library(tesseract) library(pdftools)

pdfs <- list.files("[path to your output directory containing all PDF files]")

meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)

meta$Doc.Date <- as.character(meta$Doc.Date)

meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),] for(i in 1:nrow(meta.clean)){ meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])

if(nchar(meta.clean$Doc.Date[i])<10){ meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y") }

}

meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")

meta.clean <- meta.clean[order(meta.clean$Doc.Date),]

docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F) for(i in 1:nrow(meta.clean)){

for(i in 1:3){

pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i]))) tmp_files <- c() for(k in 1:pdf_prop$pages){ tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k)) }

img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)

txt <- ""

for(j in 1:length(img_file)){ extract <- ocr(img_file[j], engine = tesseract("eng")) #unlink(img_file) txt <- paste(txt,extract,collapse = " ") }

docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F) }

write.table(docs,"[path to your output directory]/documents.csv", row.names = F)

END R DATA PROCESSING SCRIPT
H
Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It
dataverse.harvard.edu
Updated Nov 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grant Allard (2018). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/CKTAZX
Dataset updated
Nov 5, 2018
Dataset provided by
Harvard Dataverse
Authors
Grant Allard
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.
d
The fractured lab notebook: undergraduate and ecological data management...
search.dataone.org
Updated Nov 14, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for Ecological Analysis and Synthesis; Carly Strasser (2013). The fractured lab notebook: undergraduate and ecological data management training in the United States [Dataset]. https://search.dataone.org/view/knb.300.9
Explore at:
Dataset updated
Nov 14, 2013
Dataset provided by
Knowledge Network for Biocomplexity
Authors
National Center for Ecological Analysis and Synthesis; Carly Strasser
Time period covered
Mar 29, 2011 - May 25, 2011
Area covered

Variables measured
Answer, Coding, EndDate, Question, R script, StartDate, First Name, Param name, Description, RespondentID, and 157 more
Description
Data presented here are those collected from a survey of Ecology professors at 48 undergraduate institutions to assess the current state of data management education. The following files have been uploaded:

Scripts(2): 1. DataCleaning_20120105.R is an R script for cleaning up data prior to analysis. This script removes spaces, substitutes text for codes, removed duplicate schools, and converts questions and answers from the survey into more simple parameter names, without any numbers, spaces, or symbols. This script is heavily annotated to assist the user of the file in understanding what is being done to the data files. The script produces the file cleandata_[date].Rdata, which is called in the file DataTrimming_20120105.R 2. DataTrimming_20120105.R is an R script for trimming extraneous variables not used in final analyses. Some variables are combined as needed and NAs (no answers) are removed. The file is heavily annotated. It produces trimdata_[date].Rdata, which was imported into Excel for summary statistics.

Data files (3) 3. AdvancedSpreadsheet_20110526.csv is the output file from the SurveyMonkey online survey tool used for this project. It is a .csv sheet with the complete set of survey data, although some data (e.g., open-ended responses, institution names) are removed to prevent schools and/or instructors from being identifiable. This file is read into DataCleaning_20120105.R for cleaning and editing. 4. VariableRenaming_20110711.csv is called into the DataCleaning_20120105.R script to convert the questions and answers from the survey into simple parameter names, without any numbers, spaces, or symbols. 5. ParamTable.csv is a list of the parameter names used for analysis and the value codes. It can be used to understand outputs from the scripts above (cleandata_[date].Rdata and trimdata_[date].Rdata).
d
Data from: Data and code from: A natural polymer material as a pesticide...
catalog.data.gov
s.cnmilf.com
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-a-natural-polymer-material-as-a-pesticide-adjuvant-for-mitigating-off-t
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains all data and code required to clean the data, fit the models, and create the figures and tables for the laboratory experiment portion of the manuscript:Kannan, N., Q. D. Read, and W. Zhang. 2024. A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health. Heliyon, in press. https://doi.org/10.1016/j.heliyon.2024.e35510.In this dataset, we archive results from several laboratory and field trials testing different adjuvants (spray additives) that are intended to reduce particle drift, increase particle size, and slow down the particles from pesticide spray nozzles. We fit statistical models to the droplet size and speed distribution data and statistically compare different metrics between the adjuvants (sodium alginate, polyacrylamide [PAM], and control without any adjuvants). The following files are included:RawDataPAMsodAlgOxfLsr.xlsx: Raw data for primary analysesOrganizedDataPaperRevision20240614.xlsx: Raw data to produce density plots presented in Figs. 8 and 9raw_data_readme.md: Markdown file with description of the raw data filesR_code_supplement.R: All R code required to reproduce primary analysesR_code_supplement2.R: R code required to produce density plots presented in Figs. 8 and 9Intermediate R output files are also included so that tables and figures can be recreated without having to rerun the data preprocessing, model fitting, and posterior estimation steps:pam_cleaned.RData: Data combined into clean R data frames for analysisvelocityscaledlogdiamfit.rds: Fitted brms model object for velocitylnormfitreduced.rds: Fitted brms model object for diameter distributionemm_con_velo_diam_draws.RData: Posterior distributions of estimated marginal means for velocityemm_con_draws.RData: Posterior distributions of estimated marginal means for diameter distributionThe following software and package versions were used:R version 4.3.1CmdStan version 2.33.1R packages:brms version 2.20.5cmdstanr version 0.5.3fitdistrplus version 1.1-11tidybayes version 3.0.4emmeans version 1.8.9
E
USGS-CMG time-series data: GLOBEC_GSC - 490 - 4901-a
geoport.usgs.esipfed.org
Updated Apr 11, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rich Signell (2017). USGS-CMG time-series data: GLOBEC_GSC - 490 - 4901-a [Dataset]. https://geoport.usgs.esipfed.org/erddap/info/4901-a/index.html
Explore at:
Dataset updated
Apr 11, 2017
Dataset provided by
Ellyn Montgomery
Authors
Rich Signell
Time period covered
Jan 15, 1997 - Aug 17, 1997
Area covered

Variables measured
crs, east, temp, time, north, rotor, vdir_1, vspd_1, bearing, altitude, and 4 more
Description
USGS-CMG time-series data from the GLOBEC Great South Channel Circulation Experiment project, mooring 490 and package 4901-a. A moored array program to investigate the recirculation of water and plankton around Georges Bank. _NCProperties=version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.17 cdm_data_type=TimeSeries cdm_timeseries_variables=latitude, longitude, altitude, feature_type_instance contributor_name=R. Schlitz contributor_role=principalInvestigator Conventions=CF-1.6,ACDD-1.3, COARDS COORD_SYSTEM=GEOGRAPHICAL CREATION_DATE=28-May-2008 14:23:41 DATA_ORIGIN=USGS DATA_TYPE=TIME date_metadata_modified=2017-04-11T22:03:00Z DESCRIPT=VACM-C, GREAT SOUTH CHANNEL SITE 7, CLEAN DATA: NOT SCRUBBED Easternmost_Easting=-68.28616 featureType=TimeSeries geospatial_bounds=POINT(-68.28616333007812 40.5168342590332) geospatial_bounds_crs=EPSG:4326 geospatial_lat_max=40.51683 geospatial_lat_min=40.51683 geospatial_lat_resolution=0 geospatial_lat_units=degrees_north geospatial_lon_max=-68.28616 geospatial_lon_min=-68.28616 geospatial_lon_resolution=0 geospatial_lon_units=degrees_east geospatial_vertical_max=-5.0 geospatial_vertical_min=-5.0 geospatial_vertical_positive=up geospatial_vertical_resolution=0 geospatial_vertical_units=m grid_mapping_epsg_code=EPSG:4326 grid_mapping_inverse_flattening=298.257223563 grid_mapping_long_name=http://www.opengis.net/def/crs/EPSG/0/4326 grid_mapping_name=latitude_longitude grid_mapping_semi_major_axis=6378137.0 history=Fri Nov 1 20:17:32 2019: ncatted -a project,global,a,c,, CMG_Portal GLOBEC_GSC/4901-a.nc corrected sign of lon using fix_poslon.m: 2017-04-11T22:03:00Z - pyaxiom - File created using pyaxiom id=4901-a infoUrl=https://stellwagen.er.usgs.gov/ institution=USGS Coastal and Marine Geology Program keywords_vocabulary=GCMD Science Keywords latitude=40.516834 longitude=-68.28616 magnetic_variation=-17.0 MOORING=490 naming_authority=gov.usgs.cmgp ncei_template_version=NCEI_NetCDF_TimeSeries_Orthogonal_Template_v2.0 NCO=netCDF Operators version 4.8.1 (Homepage = http://nco.sf.net, Code = https://github.com/nco/nco) Northernmost_Northing=40.51683 original_filename=4901-a.nc original_folder=GLOBEC_GSC project=U.S. Geological Survey Oceanographic Time-Series Data, CMG_Portal project_summary=A moored array program to investigate the recirculation of water and plankton around Georges Bank. project_title=GLOBEC Great South Channel Circulation Experiment sampling_interval=450 source=USGS sourceUrl=(local files) Southernmost_Northing=40.51683 standard_name_vocabulary=CF Standard Name Table v29 start_time=97- I -15 19.33.45 stop_time=97-VIII-17 09.56.15 subsetVariables=latitude, longitude, altitude, feature_type_instance time_coverage_duration=PT18454950S time_coverage_end=1997-08-17T09:56:15Z time_coverage_start=1997-01-15T19:33:45Z WATER_DEPTH=101 water_depth=101.0 Westernmost_Easting=-68.28616
4
Scripts for cleaning and analysis of data from SOFC experiment on...
data.4tu.nl
zip
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berend van Veldhuizen (2024). Scripts for cleaning and analysis of data from SOFC experiment on inclination test-bench. [Dataset]. http://doi.org/10.4121/ed0a0cff-7af9-4d3a-baf7-aab5efe39bd1.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/ed0a0cff-7af9-4d3a-baf7-aab5efe39bd1.v1
Dataset updated
Aug 27, 2024
Dataset provided by
4TU.ResearchData
Authors
Berend van Veldhuizen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2023
Dataset funded by
European Commission
Description
This data set contains the scripts used for importing, trimming, cleaning, analysing, and plotting a large dataset of inclination experiments with an SOFC module. The measurement data is confidential, so it could not be published alongside the scripts. One row of dummy input data is published to illustrate the structure of the analysed data. The analysis is used for the journal paper "Experimental Evaluation of a Solid Oxide Fuel Cell System Exposed to Inclinations and Accelerations by Ship Motions".
The scripts contain:
- A script that reads the data, removes unusable data and transforms into analysable dataframes (Clean and trim.R)
- Two files to make a wide variety of plots (Plotting.R and Specificplots.R)
- A file data does a Gaussian Progress regression to estimate the degradation rate (Degradation estimation.R)
t
ANHUI CLEAN ENERGY CO.,LTDADDRESS NO318 GUANYIN R.|Full export Customs Data...
tradeindata.com
Updated Mar 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tradeindata (2022). ANHUI CLEAN ENERGY CO.,LTDADDRESS NO318 GUANYIN R.|Full export Customs Data Records|tradeindata [Dataset]. https://www.tradeindata.com/supplier_detail/?id=d68b2daa8f863333226e966775c979c9
Explore at:
Dataset updated
Mar 2, 2022
Dataset authored and provided by
tradeindata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Customs records of are available for ANHUI CLEAN ENERGY CO.,LTDADDRESS NO318 GUANYIN R.. Learn about its Importer, supply capabilities and the countries to which it supplies goods
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
Explore at:
Unique identifier
https://doi.org/10.3886/E105403V3
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road...
s.cnmilf.com
catalog.data.gov
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road Runoff 20250218 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/dataset-screening-causal-assessment-of-brook-trout-occurrence-and-road-runoff-20250218
Explore at:
Dataset updated
Apr 25, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Pedigree of all data and processing included in the manuscript. Open zip file then access pedigree folder for file describing all other folders, links, and data dictionary Items: NOTES: Description of work and other worksheets. Pedigree: Summary source files used to create figures and tables. DataFiles: Data files used in the R code for creating the figures and tables. DataDictionary: Data file titles in all data files Data: Data file uploaded to Science Hub Output: Files generated from R scripts Plot: Plots generated from R scripts and other software R_Scripts: Clean R scripts used to analyze the data, generate figures and tables Result: Tables generated from R scripts
d
Alaska Geochemical Database Version 3.0 (AGDB3) including best value data...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Alaska Geochemical Database Version 3.0 (AGDB3) including best value data compilations for rock, sediment, soil, mineral, and concentrate sample media [Dataset]. https://catalog.data.gov/dataset/alaska-geochemical-database-version-3-0-agdb3-including-best-value-data-compilations-for-r
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Alaska
Description
The Alaska Geochemical Database Version 3.0 (AGDB3) contains new geochemical data compilations in which each geologic material sample has one best value determination for each analyzed species, greatly improving speed and efficiency of use. Like the Alaska Geochemical Database Version 2.0 before it, the AGDB3 was created and designed to compile and integrate geochemical data from Alaska to facilitate geologic mapping, petrologic studies, mineral resource assessments, definition of geochemical baseline values and statistics, element concentrations and associations, environmental impact assessments, and studies in public health associated with geology. This relational database, created from databases and published datasets of the U.S. Geological Survey (USGS), Atomic Energy Commission National Uranium Resource Evaluation (NURE), Alaska Division of Geological & Geophysical Surveys (DGGS), U.S. Bureau of Mines, and U.S. Bureau of Land Management serves as a data archive in support of Alaskan geologic and geochemical projects and contains data tables in several different formats describing historical and new quantitative and qualitative geochemical analyses. The analytical results were determined by 112 laboratory and field analytical methods on 396,343 rock, sediment, soil, mineral, heavy-mineral concentrate, and oxalic acid leachate samples. Most samples were collected by personnel of these agencies and analyzed in agency laboratories or, under contracts, in commercial analytical laboratories. These data represent analyses of samples collected as part of various agency programs and projects from 1938 through 2017. In addition, mineralogical data from 18,138 nonmagnetic heavy-mineral concentrate samples are included in this database. The AGDB3 includes historical geochemical data archived in the USGS National Geochemical Database (NGDB) and NURE National Uranium Resource Evaluation-Hydrogeochemical and Stream Sediment Reconnaissance databases, and in the DGGS Geochemistry database. Retrievals from these databases were used to generate most of the AGDB data set. These data were checked for accuracy regarding sample location, sample media type, and analytical methods used. In other words, the data of the AGDB3 supersedes data in the AGDB and the AGDB2, but the background about the data in these two earlier versions are needed by users of the current AGDB3 to understand what has been done to amend, clean up, correct and format this data. Corrections were entered, resulting in a significantly improved Alaska geochemical dataset, the AGDB3. Data that were not previously in these databases because the data predate the earliest agency geochemical databases, or were once excluded for programmatic reasons, are included here in the AGDB3 and will be added to the NGDB and Alaska Geochemistry. The AGDB3 data provided here are the most accurate and complete to date and should be useful for a wide variety of geochemical studies. The AGDB3 data provided in the online version of the database may be updated or changed periodically.
n
Data from: Designing data science workshops for data-intensive environmental...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
California State Polytechnic University
Montana State University
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
OpenML R Bot Benchmark Data (final subset)
figshare.com
application/gzip
Updated May 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl (2018). OpenML R Bot Benchmark Data (final subset) [Dataset]. http://doi.org/10.6084/m9.figshare.5882230.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5882230.v2
Dataset updated
May 18, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a clean subset of the data that was created by the OpenML R Bot that executed benchmark experiments on binary classification task of the OpenML100 benchmarking suite with six R algorithms: glmnet, rpart, kknn, svm, ranger and xgboost. The hyperparameters of these algorithms were drawn randomly. In total it contains more than 2.6 million benchmark experiments and can be used by other researchers. The subset was created by taking 500000 results of each learner (except of kknn for which only 1140 results are available). The csv-file for each learner is a table that for each benchmark experiment has a row that contains: OpenML-Data ID, hyperparameter values, performance measures (AUC, accuracy, brier score), runtime, scimark (runtime reference of the machine), and some meta features of the dataset.OpenMLRandomBotResults.RData (format for R) contains all data in seperate tables for the results, the hyperparameters, the meta features, the runtime, the scimark results and reference results.
d
Trace Metal clean rosette bottle hydrographic and nutrient data from R/V...
search.dataone.org
bco-dmo.org
+1more
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ken Johnson; Dr Zanna Chase (2021). Trace Metal clean rosette bottle hydrographic and nutrient data from R/V Roger Revelle cruise DRFT08RR from the Southern Ocean, south of New Zealand in 2002 (SOFeX project) [Dataset]. https://search.dataone.org/view/sha256:b59db2cbdbba6c35b0f7b99c2d2c3f54455d566000d044ccb78d2700fece2a5d
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Biological and Chemical Oceanography Data Management Office (BCO-DMO)
Authors
Ken Johnson; Dr Zanna Chase
Area covered
Southern Ocean
Description
Trace Metal Clean Rosette bottle hydrographic and nutrient data