100+ datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. Data cleaning EVI2

    • figshare.com
    txt
    Updated May 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geraldine Klarenberg (2019). Data cleaning EVI2 [Dataset]. http://doi.org/10.6084/m9.figshare.5327527.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 13, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Geraldine Klarenberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scripts to clean EVI2 data obtained from the VIP lab (University of Arizona) website (https://vip.arizona.edu/about.php and https://vip.arizona.edu/viplab_data_explorer.php). Data obtained in 2012.- outlier detection and removal/replacement- alignment of 2 periodsThe manuscript detailing the methods and resulting data sets has been accepted for publication in Nature Scientific Data (05/11/2019).Instructions: use the R Markdown html file for instructions!Code last manipulated and tested in R 3.4.3 ("Kite-Eating Tree")

  3. q

    Writing Clean Code in R Workshop

    • qubeshub.org
    Updated Oct 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
    Explore at:
    Dataset updated
    Oct 15, 2019
    Dataset provided by
    QUBES
    Authors
    Max Joseph; Leah Wasser
    Description

    When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.

  4. R/r custom clean llc Import Company US

    • seair.co.in
    Updated Jan 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2018). R/r custom clean llc Import Company US [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Jan 11, 2018
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  5. f

    Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    figshare
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  6. R Code of Simulations

    • catalog.data.gov
    • cloud.csiss.gmu.edu
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). R Code of Simulations [Dataset]. https://catalog.data.gov/dataset/r-code-of-simulations
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The sims zip file contains R code and accompanying files needed to run the R code. Overall this code demonstrates the R code used in the study is fully functional, documented, and reproducible and that this code could reproduce the simulation results from the study with sufficient computing time. The code as presented is for a single simulated dataset and will produce estimates and confidence intervals produced by all the methods used within the study when run on that one dataset. This dataset is associated with the following publication: Nethery, R., F. Mealli, J. Sacks, and F. Dominici. Evaluation of the Health Impacts of the 1990 Clean Air Act Amendments Using Causal Inference and Machine Learning. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION. Taylor & Francis Group, London, UK, 1-12, (2020).

  7. d

    Replication Data for: realdata

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu, Ningning (2023). Replication Data for: realdata [Dataset]. http://doi.org/10.7910/DVN/AFZZVP
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Xu, Ningning
    Description

    (1) dataandpathway_eisner.R, dataandpathway_bordbar.R, dataandpathway_taware.R and dataandpathway_almutawa.R: functions and codes to clean the realdata sets and obtain the annotation databases, which are save as .RData files in sudfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (2) FWER_excess.R: functions to show the inflation of FWER when integrating multiple annotation databases and to generate Table 1. (3) data_info.R: code to obtain Table 2 and Table 3. (4) rejections_perdataset.R and triangulartable.R: functions to generate Table 4. The runing time of rejections_perdataset.R is 7 hours around, we thus save the corresponding results as res_eisner.RData, res_bordbar.RData, res_taware.RData and res_almutawa.RData in subfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (5) pathwaysizerank.R: code for generating Figure 4 based on res_eisner.RData from (h). (6) iterationandtime_plot.R: code for generating Figure 5 based on “Al-Mutawa” data. The code is really time-consuming, nearly 5 days, we thus save the corresponding results and plot them in the main manuscript by pgfplot.

  8. Z

    A dataset for temporal analysis of files related to the JFK case

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luczak-Roesch, Markus (2020). A dataset for temporal analysis of files related to the JFK case [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1042153
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Luczak-Roesch, Markus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.

    The code to derive the dataset is given as follows:

    BEGIN R DATA PROCESSING SCRIPT

    library(tesseract) library(pdftools)

    pdfs <- list.files("[path to your output directory containing all PDF files]")

    meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)

    meta$Doc.Date <- as.character(meta$Doc.Date)

    meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),] for(i in 1:nrow(meta.clean)){ meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])

    if(nchar(meta.clean$Doc.Date[i])<10){ meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y") }

    }

    meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")

    meta.clean <- meta.clean[order(meta.clean$Doc.Date),]

    docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F) for(i in 1:nrow(meta.clean)){

    for(i in 1:3){

    pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i]))) tmp_files <- c() for(k in 1:pdf_prop$pages){ tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k)) }

    img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)

    txt <- ""

    for(j in 1:length(img_file)){ extract <- ocr(img_file[j], engine = tesseract("eng")) #unlink(img_file) txt <- paste(txt,extract,collapse = " ") }

    docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F) }

    write.table(docs,"[path to your output directory]/documents.csv", row.names = F)

    END R DATA PROCESSING SCRIPT

  9. H

    Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It

    • dataverse.harvard.edu
    Updated Nov 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grant Allard (2018). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 5, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Grant Allard
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.

  10. d

    The fractured lab notebook: undergraduate and ecological data management...

    • search.dataone.org
    Updated Nov 14, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for Ecological Analysis and Synthesis; Carly Strasser (2013). The fractured lab notebook: undergraduate and ecological data management training in the United States [Dataset]. https://search.dataone.org/view/knb.300.9
    Explore at:
    Dataset updated
    Nov 14, 2013
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    National Center for Ecological Analysis and Synthesis; Carly Strasser
    Time period covered
    Mar 29, 2011 - May 25, 2011
    Area covered
    Variables measured
    Answer, Coding, EndDate, Question, R script, StartDate, First Name, Param name, Description, RespondentID, and 157 more
    Description

    Data presented here are those collected from a survey of Ecology professors at 48 undergraduate institutions to assess the current state of data management education. The following files have been uploaded:

    Scripts(2): 1. DataCleaning_20120105.R is an R script for cleaning up data prior to analysis. This script removes spaces, substitutes text for codes, removed duplicate schools, and converts questions and answers from the survey into more simple parameter names, without any numbers, spaces, or symbols. This script is heavily annotated to assist the user of the file in understanding what is being done to the data files. The script produces the file cleandata_[date].Rdata, which is called in the file DataTrimming_20120105.R 2. DataTrimming_20120105.R is an R script for trimming extraneous variables not used in final analyses. Some variables are combined as needed and NAs (no answers) are removed. The file is heavily annotated. It produces trimdata_[date].Rdata, which was imported into Excel for summary statistics.

    Data files (3) 3. AdvancedSpreadsheet_20110526.csv is the output file from the SurveyMonkey online survey tool used for this project. It is a .csv sheet with the complete set of survey data, although some data (e.g., open-ended responses, institution names) are removed to prevent schools and/or instructors from being identifiable. This file is read into DataCleaning_20120105.R for cleaning and editing. 4. VariableRenaming_20110711.csv is called into the DataCleaning_20120105.R script to convert the questions and answers from the survey into simple parameter names, without any numbers, spaces, or symbols. 5. ParamTable.csv is a list of the parameter names used for analysis and the value codes. It can be used to understand outputs from the scripts above (cleandata_[date].Rdata and trimdata_[date].Rdata).

  11. d

    Data from: Data and code from: A natural polymer material as a pesticide...

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-a-natural-polymer-material-as-a-pesticide-adjuvant-for-mitigating-off-t
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains all data and code required to clean the data, fit the models, and create the figures and tables for the laboratory experiment portion of the manuscript:Kannan, N., Q. D. Read, and W. Zhang. 2024. A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health. Heliyon, in press. https://doi.org/10.1016/j.heliyon.2024.e35510.In this dataset, we archive results from several laboratory and field trials testing different adjuvants (spray additives) that are intended to reduce particle drift, increase particle size, and slow down the particles from pesticide spray nozzles. We fit statistical models to the droplet size and speed distribution data and statistically compare different metrics between the adjuvants (sodium alginate, polyacrylamide [PAM], and control without any adjuvants). The following files are included:RawDataPAMsodAlgOxfLsr.xlsx: Raw data for primary analysesOrganizedDataPaperRevision20240614.xlsx: Raw data to produce density plots presented in Figs. 8 and 9raw_data_readme.md: Markdown file with description of the raw data filesR_code_supplement.R: All R code required to reproduce primary analysesR_code_supplement2.R: R code required to produce density plots presented in Figs. 8 and 9Intermediate R output files are also included so that tables and figures can be recreated without having to rerun the data preprocessing, model fitting, and posterior estimation steps:pam_cleaned.RData: Data combined into clean R data frames for analysisvelocityscaledlogdiamfit.rds: Fitted brms model object for velocitylnormfitreduced.rds: Fitted brms model object for diameter distributionemm_con_velo_diam_draws.RData: Posterior distributions of estimated marginal means for velocityemm_con_draws.RData: Posterior distributions of estimated marginal means for diameter distributionThe following software and package versions were used:R version 4.3.1CmdStan version 2.33.1R packages:brms version 2.20.5cmdstanr version 0.5.3fitdistrplus version 1.1-11tidybayes version 3.0.4emmeans version 1.8.9

  12. E

    USGS-CMG time-series data: GLOBEC_GSC - 490 - 4901-a

    • geoport.usgs.esipfed.org
    Updated Apr 11, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rich Signell (2017). USGS-CMG time-series data: GLOBEC_GSC - 490 - 4901-a [Dataset]. https://geoport.usgs.esipfed.org/erddap/info/4901-a/index.html
    Explore at:
    Dataset updated
    Apr 11, 2017
    Dataset provided by
    Ellyn Montgomery
    Authors
    Rich Signell
    Time period covered
    Jan 15, 1997 - Aug 17, 1997
    Area covered
    Variables measured
    crs, east, temp, time, north, rotor, vdir_1, vspd_1, bearing, altitude, and 4 more
    Description

    USGS-CMG time-series data from the GLOBEC Great South Channel Circulation Experiment project, mooring 490 and package 4901-a. A moored array program to investigate the recirculation of water and plankton around Georges Bank. _NCProperties=version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.17 cdm_data_type=TimeSeries cdm_timeseries_variables=latitude, longitude, altitude, feature_type_instance contributor_name=R. Schlitz contributor_role=principalInvestigator Conventions=CF-1.6,ACDD-1.3, COARDS COORD_SYSTEM=GEOGRAPHICAL CREATION_DATE=28-May-2008 14:23:41 DATA_ORIGIN=USGS DATA_TYPE=TIME date_metadata_modified=2017-04-11T22:03:00Z DESCRIPT=VACM-C, GREAT SOUTH CHANNEL SITE 7, CLEAN DATA: NOT SCRUBBED Easternmost_Easting=-68.28616 featureType=TimeSeries geospatial_bounds=POINT(-68.28616333007812 40.5168342590332) geospatial_bounds_crs=EPSG:4326 geospatial_lat_max=40.51683 geospatial_lat_min=40.51683 geospatial_lat_resolution=0 geospatial_lat_units=degrees_north geospatial_lon_max=-68.28616 geospatial_lon_min=-68.28616 geospatial_lon_resolution=0 geospatial_lon_units=degrees_east geospatial_vertical_max=-5.0 geospatial_vertical_min=-5.0 geospatial_vertical_positive=up geospatial_vertical_resolution=0 geospatial_vertical_units=m grid_mapping_epsg_code=EPSG:4326 grid_mapping_inverse_flattening=298.257223563 grid_mapping_long_name=http://www.opengis.net/def/crs/EPSG/0/4326 grid_mapping_name=latitude_longitude grid_mapping_semi_major_axis=6378137.0 history=Fri Nov 1 20:17:32 2019: ncatted -a project,global,a,c,, CMG_Portal GLOBEC_GSC/4901-a.nc corrected sign of lon using fix_poslon.m: 2017-04-11T22:03:00Z - pyaxiom - File created using pyaxiom id=4901-a infoUrl=https://stellwagen.er.usgs.gov/ institution=USGS Coastal and Marine Geology Program keywords_vocabulary=GCMD Science Keywords latitude=40.516834 longitude=-68.28616 magnetic_variation=-17.0 MOORING=490 naming_authority=gov.usgs.cmgp ncei_template_version=NCEI_NetCDF_TimeSeries_Orthogonal_Template_v2.0 NCO=netCDF Operators version 4.8.1 (Homepage = http://nco.sf.net, Code = https://github.com/nco/nco) Northernmost_Northing=40.51683 original_filename=4901-a.nc original_folder=GLOBEC_GSC project=U.S. Geological Survey Oceanographic Time-Series Data, CMG_Portal project_summary=A moored array program to investigate the recirculation of water and plankton around Georges Bank. project_title=GLOBEC Great South Channel Circulation Experiment sampling_interval=450 source=USGS sourceUrl=(local files) Southernmost_Northing=40.51683 standard_name_vocabulary=CF Standard Name Table v29 start_time=97- I -15 19.33.45 stop_time=97-VIII-17 09.56.15 subsetVariables=latitude, longitude, altitude, feature_type_instance time_coverage_duration=PT18454950S time_coverage_end=1997-08-17T09:56:15Z time_coverage_start=1997-01-15T19:33:45Z WATER_DEPTH=101 water_depth=101.0 Westernmost_Easting=-68.28616

  13. 4

    Scripts for cleaning and analysis of data from SOFC experiment on...

    • data.4tu.nl
    zip
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berend van Veldhuizen (2024). Scripts for cleaning and analysis of data from SOFC experiment on inclination test-bench. [Dataset]. http://doi.org/10.4121/ed0a0cff-7af9-4d3a-baf7-aab5efe39bd1.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 27, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Berend van Veldhuizen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2023
    Dataset funded by
    European Commission
    Description

    This data set contains the scripts used for importing, trimming, cleaning, analysing, and plotting a large dataset of inclination experiments with an SOFC module. The measurement data is confidential, so it could not be published alongside the scripts. One row of dummy input data is published to illustrate the structure of the analysed data. The analysis is used for the journal paper "Experimental Evaluation of a Solid Oxide Fuel Cell System Exposed to Inclinations and Accelerations by Ship Motions".

    The scripts contain:

    - A script that reads the data, removes unusable data and transforms into analysable dataframes (Clean and trim.R)

    - Two files to make a wide variety of plots (Plotting.R and Specificplots.R)

    - A file data does a Gaussian Progress regression to estimate the degradation rate (Degradation estimation.R)

  14. t

    ANHUI CLEAN ENERGY CO.,LTDADDRESS NO318 GUANYIN R.|Full export Customs Data...

    • tradeindata.com
    Updated Mar 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tradeindata (2022). ANHUI CLEAN ENERGY CO.,LTDADDRESS NO318 GUANYIN R.|Full export Customs Data Records|tradeindata [Dataset]. https://www.tradeindata.com/supplier_detail/?id=d68b2daa8f863333226e966775c979c9
    Explore at:
    Dataset updated
    Mar 2, 2022
    Dataset authored and provided by
    tradeindata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Customs records of are available for ANHUI CLEAN ENERGY CO.,LTDADDRESS NO318 GUANYIN R.. Learn about its Importer, supply capabilities and the countries to which it supplies goods

  15. g

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • datasearch.gesis.org
    • openicpsr.org
    Updated Feb 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
    Explore at:
    Dataset updated
    Feb 19, 2020
    Dataset provided by
    da|ra (Registration agency for social science and economic data)
    Authors
    Kaplan, Jacob
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

  16. d

    Alaska Geochemical Database Version 3.0 (AGDB3) including best value data...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Alaska Geochemical Database Version 3.0 (AGDB3) including best value data compilations for rock, sediment, soil, mineral, and concentrate sample media [Dataset]. https://catalog.data.gov/dataset/alaska-geochemical-database-version-3-0-agdb3-including-best-value-data-compilations-for-r
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Alaska
    Description

    The Alaska Geochemical Database Version 3.0 (AGDB3) contains new geochemical data compilations in which each geologic material sample has one best value determination for each analyzed species, greatly improving speed and efficiency of use. Like the Alaska Geochemical Database Version 2.0 before it, the AGDB3 was created and designed to compile and integrate geochemical data from Alaska to facilitate geologic mapping, petrologic studies, mineral resource assessments, definition of geochemical baseline values and statistics, element concentrations and associations, environmental impact assessments, and studies in public health associated with geology. This relational database, created from databases and published datasets of the U.S. Geological Survey (USGS), Atomic Energy Commission National Uranium Resource Evaluation (NURE), Alaska Division of Geological & Geophysical Surveys (DGGS), U.S. Bureau of Mines, and U.S. Bureau of Land Management serves as a data archive in support of Alaskan geologic and geochemical projects and contains data tables in several different formats describing historical and new quantitative and qualitative geochemical analyses. The analytical results were determined by 112 laboratory and field analytical methods on 396,343 rock, sediment, soil, mineral, heavy-mineral concentrate, and oxalic acid leachate samples. Most samples were collected by personnel of these agencies and analyzed in agency laboratories or, under contracts, in commercial analytical laboratories. These data represent analyses of samples collected as part of various agency programs and projects from 1938 through 2017. In addition, mineralogical data from 18,138 nonmagnetic heavy-mineral concentrate samples are included in this database. The AGDB3 includes historical geochemical data archived in the USGS National Geochemical Database (NGDB) and NURE National Uranium Resource Evaluation-Hydrogeochemical and Stream Sediment Reconnaissance databases, and in the DGGS Geochemistry database. Retrievals from these databases were used to generate most of the AGDB data set. These data were checked for accuracy regarding sample location, sample media type, and analytical methods used. In other words, the data of the AGDB3 supersedes data in the AGDB and the AGDB2, but the background about the data in these two earlier versions are needed by users of the current AGDB3 to understand what has been done to amend, clean up, correct and format this data. Corrections were entered, resulting in a significantly improved Alaska geochemical dataset, the AGDB3. Data that were not previously in these databases because the data predate the earliest agency geochemical databases, or were once excluded for programmatic reasons, are included here in the AGDB3 and will be added to the NGDB and Alaska Geochemistry. The AGDB3 data provided here are the most accurate and complete to date and should be useful for a wide variety of geochemical studies. The AGDB3 data provided in the online version of the database may be updated or changed periodically.

  17. Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road...

    • s.cnmilf.com
    • catalog.data.gov
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2025). Dataset: Screening Causal Assessment of Brook Trout Occurrence and Road Runoff 20250218 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/dataset-screening-causal-assessment-of-brook-trout-occurrence-and-road-runoff-20250218
    Explore at:
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Pedigree of all data and processing included in the manuscript. Open zip file then access pedigree folder for file describing all other folders, links, and data dictionary Items: NOTES: Description of work and other worksheets. Pedigree: Summary source files used to create figures and tables. DataFiles: Data files used in the R code for creating the figures and tables. DataDictionary: Data file titles in all data files Data: Data file uploaded to Science Hub Output: Files generated from R scripts Plot: Plots generated from R scripts and other software R_Scripts: Clean R scripts used to analyze the data, generate figures and tables Result: Tables generated from R scripts

  18. n

    Data from: Designing data science workshops for data-intensive environmental...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 8, 2020
    Dataset provided by
    California State Polytechnic University
    Montana State University
    Authors
    Allison Theobold; Stacey Hancock; Sara Mannheimer
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

    Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

    Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

    The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. 
    The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
    
      The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
    
    
    The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. 
    The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. 
    The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
    
  19. OpenML R Bot Benchmark Data (final subset)

    • figshare.com
    application/gzip
    Updated May 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl (2018). OpenML R Bot Benchmark Data (final subset) [Dataset]. http://doi.org/10.6084/m9.figshare.5882230.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 18, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a clean subset of the data that was created by the OpenML R Bot that executed benchmark experiments on binary classification task of the OpenML100 benchmarking suite with six R algorithms: glmnet, rpart, kknn, svm, ranger and xgboost. The hyperparameters of these algorithms were drawn randomly. In total it contains more than 2.6 million benchmark experiments and can be used by other researchers. The subset was created by taking 500000 results of each learner (except of kknn for which only 1140 results are available). The csv-file for each learner is a table that for each benchmark experiment has a row that contains: OpenML-Data ID, hyperparameter values, performance measures (AUC, accuracy, brier score), runtime, scimark (runtime reference of the machine), and some meta features of the dataset.OpenMLRandomBotResults.RData (format for R) contains all data in seperate tables for the results, the hyperparameters, the meta features, the runtime, the scimark results and reference results.

  20. d

    Trace Metal clean rosette bottle hydrographic and nutrient data from R/V...

    • search.dataone.org
    • bco-dmo.org
    • +1more
    Updated Dec 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ken Johnson; Dr Zanna Chase (2021). Trace Metal clean rosette bottle hydrographic and nutrient data from R/V Roger Revelle cruise DRFT08RR from the Southern Ocean, south of New Zealand in 2002 (SOFeX project) [Dataset]. https://search.dataone.org/view/sha256:b59db2cbdbba6c35b0f7b99c2d2c3f54455d566000d044ccb78d2700fece2a5d
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Biological and Chemical Oceanography Data Management Office (BCO-DMO)
    Authors
    Ken Johnson; Dr Zanna Chase
    Area covered
    Southern Ocean
    Description

    Trace Metal Clean Rosette bottle hydrographic and nutrient data

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
157 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu