100+ datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. q

    Writing Clean Code in R Workshop

    • qubeshub.org
    Updated Oct 15, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Joseph; Leah Wasser (2019). Writing Clean Code in R Workshop [Dataset]. https://qubeshub.org/publications/1442
    Explore at:
    Dataset updated
    Oct 15, 2019
    Dataset provided by
    QUBES
    Authors
    Max Joseph; Leah Wasser
    Description

    When working with data, you often spend the most amount of time cleaning your data. Learn how to write more efficient code using the tidyverse in R.

  3. f

    Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    figshare
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  4. Data cleaning EVI2

    • figshare.com
    txt
    Updated May 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geraldine Klarenberg (2019). Data cleaning EVI2 [Dataset]. http://doi.org/10.6084/m9.figshare.5327527.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 13, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Geraldine Klarenberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scripts to clean EVI2 data obtained from the VIP lab (University of Arizona) website (https://vip.arizona.edu/about.php and https://vip.arizona.edu/viplab_data_explorer.php). Data obtained in 2012.- outlier detection and removal/replacement- alignment of 2 periodsThe manuscript detailing the methods and resulting data sets has been accepted for publication in Nature Scientific Data (05/11/2019).Instructions: use the R Markdown html file for instructions!Code last manipulated and tested in R 3.4.3 ("Kite-Eating Tree")

  5. Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem - the dataset [Dataset]. http://doi.org/10.5281/zenodo.1297925
    Explore at:
    text/x-python, zip, bin, application/gzipAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  6. R Code of Simulations

    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). R Code of Simulations [Dataset]. https://catalog.data.gov/dataset/r-code-of-simulations
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The sims zip file contains R code and accompanying files needed to run the R code. Overall this code demonstrates the R code used in the study is fully functional, documented, and reproducible and that this code could reproduce the simulation results from the study with sufficient computing time. The code as presented is for a single simulated dataset and will produce estimates and confidence intervals produced by all the methods used within the study when run on that one dataset. This dataset is associated with the following publication: Nethery, R., F. Mealli, J. Sacks, and F. Dominici. Evaluation of the Health Impacts of the 1990 Clean Air Act Amendments Using Causal Inference and Machine Learning. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION. Taylor & Francis Group, London, UK, 1-12, (2020).

  7. E

    USGS-CMG time-series data: GLOBEC_GSC - 490 - 4901-a

    • geoport.usgs.esipfed.org
    Updated Apr 11, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rich Signell (2017). USGS-CMG time-series data: GLOBEC_GSC - 490 - 4901-a [Dataset]. https://geoport.usgs.esipfed.org/erddap/info/4901-a/index.html
    Explore at:
    Dataset updated
    Apr 11, 2017
    Dataset provided by
    Ellyn Montgomery
    Authors
    Rich Signell
    Time period covered
    Jan 15, 1997 - Aug 17, 1997
    Area covered
    Variables measured
    crs, east, temp, time, north, rotor, vdir_1, vspd_1, bearing, altitude, and 4 more
    Description

    USGS-CMG time-series data from the GLOBEC Great South Channel Circulation Experiment project, mooring 490 and package 4901-a. A moored array program to investigate the recirculation of water and plankton around Georges Bank. _NCProperties=version=1|netcdflibversion=4.4.1.1|hdf5libversion=1.8.17 cdm_data_type=TimeSeries cdm_timeseries_variables=latitude, longitude, altitude, feature_type_instance contributor_name=R. Schlitz contributor_role=principalInvestigator Conventions=CF-1.6,ACDD-1.3, COARDS COORD_SYSTEM=GEOGRAPHICAL CREATION_DATE=28-May-2008 14:23:41 DATA_ORIGIN=USGS DATA_TYPE=TIME date_metadata_modified=2017-04-11T22:03:00Z DESCRIPT=VACM-C, GREAT SOUTH CHANNEL SITE 7, CLEAN DATA: NOT SCRUBBED Easternmost_Easting=-68.28616 featureType=TimeSeries geospatial_bounds=POINT(-68.28616333007812 40.5168342590332) geospatial_bounds_crs=EPSG:4326 geospatial_lat_max=40.51683 geospatial_lat_min=40.51683 geospatial_lat_resolution=0 geospatial_lat_units=degrees_north geospatial_lon_max=-68.28616 geospatial_lon_min=-68.28616 geospatial_lon_resolution=0 geospatial_lon_units=degrees_east geospatial_vertical_max=-5.0 geospatial_vertical_min=-5.0 geospatial_vertical_positive=up geospatial_vertical_resolution=0 geospatial_vertical_units=m grid_mapping_epsg_code=EPSG:4326 grid_mapping_inverse_flattening=298.257223563 grid_mapping_long_name=http://www.opengis.net/def/crs/EPSG/0/4326 grid_mapping_name=latitude_longitude grid_mapping_semi_major_axis=6378137.0 history=Fri Nov 1 20:17:32 2019: ncatted -a project,global,a,c,, CMG_Portal GLOBEC_GSC/4901-a.nc corrected sign of lon using fix_poslon.m: 2017-04-11T22:03:00Z - pyaxiom - File created using pyaxiom id=4901-a infoUrl=https://stellwagen.er.usgs.gov/ institution=USGS Coastal and Marine Geology Program keywords_vocabulary=GCMD Science Keywords latitude=40.516834 longitude=-68.28616 magnetic_variation=-17.0 MOORING=490 naming_authority=gov.usgs.cmgp ncei_template_version=NCEI_NetCDF_TimeSeries_Orthogonal_Template_v2.0 NCO=netCDF Operators version 4.8.1 (Homepage = http://nco.sf.net, Code = https://github.com/nco/nco) Northernmost_Northing=40.51683 original_filename=4901-a.nc original_folder=GLOBEC_GSC project=U.S. Geological Survey Oceanographic Time-Series Data, CMG_Portal project_summary=A moored array program to investigate the recirculation of water and plankton around Georges Bank. project_title=GLOBEC Great South Channel Circulation Experiment sampling_interval=450 source=USGS sourceUrl=(local files) Southernmost_Northing=40.51683 standard_name_vocabulary=CF Standard Name Table v29 start_time=97- I -15 19.33.45 stop_time=97-VIII-17 09.56.15 subsetVariables=latitude, longitude, altitude, feature_type_instance time_coverage_duration=PT18454950S time_coverage_end=1997-08-17T09:56:15Z time_coverage_start=1997-01-15T19:33:45Z WATER_DEPTH=101 water_depth=101.0 Westernmost_Easting=-68.28616

  8. Z

    A dataset for temporal analysis of files related to the JFK case

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luczak-Roesch, Markus (2020). A dataset for temporal analysis of files related to the JFK case [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1042153
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Luczak-Roesch, Markus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.

    The code to derive the dataset is given as follows:

    BEGIN R DATA PROCESSING SCRIPT

    library(tesseract) library(pdftools)

    pdfs <- list.files("[path to your output directory containing all PDF files]")

    meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)

    meta$Doc.Date <- as.character(meta$Doc.Date)

    meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),] for(i in 1:nrow(meta.clean)){ meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])

    if(nchar(meta.clean$Doc.Date[i])<10){ meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y") }

    }

    meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")

    meta.clean <- meta.clean[order(meta.clean$Doc.Date),]

    docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F) for(i in 1:nrow(meta.clean)){

    for(i in 1:3){

    pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i]))) tmp_files <- c() for(k in 1:pdf_prop$pages){ tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k)) }

    img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)

    txt <- ""

    for(j in 1:length(img_file)){ extract <- ocr(img_file[j], engine = tesseract("eng")) #unlink(img_file) txt <- paste(txt,extract,collapse = " ") }

    docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F) }

    write.table(docs,"[path to your output directory]/documents.csv", row.names = F)

    END R DATA PROCESSING SCRIPT

  9. H

    Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Nov 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grant Allard (2018). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 5, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Grant Allard
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.

  10. d

    Data from: Data and code from: A natural polymer material as a pesticide...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-a-natural-polymer-material-as-a-pesticide-adjuvant-for-mitigating-off-t
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains all data and code required to clean the data, fit the models, and create the figures and tables for the laboratory experiment portion of the manuscript:Kannan, N., Q. D. Read, and W. Zhang. 2024. A natural polymer material as a pesticide adjuvant for mitigating off-target drift and protecting pollinator health. Heliyon, in press. https://doi.org/10.1016/j.heliyon.2024.e35510.In this dataset, we archive results from several laboratory and field trials testing different adjuvants (spray additives) that are intended to reduce particle drift, increase particle size, and slow down the particles from pesticide spray nozzles. We fit statistical models to the droplet size and speed distribution data and statistically compare different metrics between the adjuvants (sodium alginate, polyacrylamide [PAM], and control without any adjuvants). The following files are included:RawDataPAMsodAlgOxfLsr.xlsx: Raw data for primary analysesOrganizedDataPaperRevision20240614.xlsx: Raw data to produce density plots presented in Figs. 8 and 9raw_data_readme.md: Markdown file with description of the raw data filesR_code_supplement.R: All R code required to reproduce primary analysesR_code_supplement2.R: R code required to produce density plots presented in Figs. 8 and 9Intermediate R output files are also included so that tables and figures can be recreated without having to rerun the data preprocessing, model fitting, and posterior estimation steps:pam_cleaned.RData: Data combined into clean R data frames for analysisvelocityscaledlogdiamfit.rds: Fitted brms model object for velocitylnormfitreduced.rds: Fitted brms model object for diameter distributionemm_con_velo_diam_draws.RData: Posterior distributions of estimated marginal means for velocityemm_con_draws.RData: Posterior distributions of estimated marginal means for diameter distributionThe following software and package versions were used:R version 4.3.1CmdStan version 2.33.1R packages:brms version 2.20.5cmdstanr version 0.5.3fitdistrplus version 1.1-11tidybayes version 3.0.4emmeans version 1.8.9

  11. d

    Data from: Designing data science workshops for data-intensive environmental...

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 8, 2020
    Dataset provided by
    Dryad
    Authors
    Allison Theobold; Stacey Hancock; Sara Mannheimer
    Time period covered
    Nov 14, 2020
    Description

    Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

    Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

    The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. 
    The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
    
      The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
    
    
    The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, r...
    
  12. Bitter Creek Analysis Pedigree Data

    • s.cnmilf.com
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • +1more
    Updated Sep 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Bitter Creek Analysis Pedigree Data [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/bitter-creek-analysis-pedigree-data
    Explore at:
    Dataset updated
    Sep 25, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These data sets contain raw and processed data used in for analyses, figures, and tables in the Region 8 Memo: Characterization of chloride and conductivity levels in the Bitter Creek Watershed, WY. However, these data may be used for other analyses alone or in combination with other or new data. These data were used to assess whether chloride levels are naturally high in streams in the Bitter Creek, WY watershed and how chloride concentrations expected to protect 95 percent of aquatic genera in these streams compare to Wyoming’s chloride criteria applicable to the Bitter Creek watershed. Owing to the arid conditions, background conductivity and chloride levels were characterized for surface flow and ground water flow conditions. Natural chloride levels were found to be less than current water quality criteria for Wyoming. Although the report was prepared for USEPA Region 8 and OST, Office of Water, the report will be of interest to the WDEQ, Sweetwater County Conservation District, and the regulated community. No formal metadata standard was used. Pedigree.xlsx contains: 1. NOTES: Description of work and other worksheets. 2. Pedigree_Summary: Source files used to create figures and tables. 3. DataFiles: Data files used in the R code for creating the figures and tables 4. R_Script: Summary of the R scripts. 5. DataDictionary: Data file titles in all data files Folders: _Datasets Data file uploaded to Environmental Dataset Gateway "A list of subfolders: _R: Clean R scripts used to generate document figures and tables _Tables_Figures: Files generated from R script and used in the Region 6 memo R Code and Data: All additional files used for this project, including original files, intermediate files, extra output files, and extra functions the ""_R"" folder stores R scripts for input and output files and an R project file.. Users can open the R project and run R scripts directly from the ""_R"" folder or the XC95 folder by installing R, RStudio, and associated R packages."

  13. h

    libritts-r-clean-with-audio

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaehoon Kang, libritts-r-clean-with-audio [Dataset]. https://huggingface.co/datasets/morateng/libritts-r-clean-with-audio
    Explore at:
    Authors
    Jaehoon Kang
    Description

    morateng/libritts-r-clean-with-audio dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. g

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • datasearch.gesis.org
    • openicpsr.org
    Updated Feb 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
    Explore at:
    Dataset updated
    Feb 19, 2020
    Dataset provided by
    da|ra (Registration agency for social science and economic data)
    Authors
    Kaplan, Jacob
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

  15. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Massachusetts General Hospital
    Harvard Medical School
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  16. H

    Replication Data for: Race, gender, and the politics of incivility

    • dataverse.harvard.edu
    Updated Jun 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sam Gubitz (2020). Replication Data for: Race, gender, and the politics of incivility [Dataset]. http://doi.org/10.7910/DVN/ODPNI8
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Sam Gubitz
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Use the project file first, then open the cleaning R file to clean the raw data. Then use the R file called OLS analysis to analyze the cleaned data, which was outputted as a .rds file.

  17. Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022

    • catalog.data.gov
    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2025). Data set: St. Louis River Watershed, MN Conductivity Assessment March 2022 [Dataset]. https://catalog.data.gov/dataset/data-set-st-louis-river-watershed-mn-conductivity-assessment-march-2022
    Explore at:
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    Saint Louis River, Minnesota
    Description

    Data used to evaluate potential downstream impacts of the NorthMet Mine, by USEPA Office of Research and Development is providing, for USEPA Region 5’s use, including a characterization of stream specific conductivity (SC) levels, least disturbed background SC, and SC levels that may exceed the Fond du Lac Band’s WQ standards and adversely affect aquatic life, including brook trout (Salvelinus fontinalis), lake sturgeon (Acipenser fulvescens), and benthic macroinvertebrates. Keywords: Conductivity, St. Louis River, benthic invertebrates; mining The attached Excel Pedigree includes: _Datasets: Data file uploaded to EPA Science Hub and/or Environmental Data Set Gateway _R : Clean R scripts used to generate document figures and tables _Tables_Figures: Files generated from R script and used in the Region 5 memo 20220325 R Code and Data: All additional files used for this project, including original files, intermediate files, extra output files, and extra functions. The "_R" folder contains four subfolders. Each subfolder has several R scripts, input and output files, and an R project file. Users can run R scripts directly from each subfolder by installing R, RStudio, and associated R packages. Data Dictionary: See tab DataDictionary in Excel file Datasets: Simplified language is used in the text to identify parent data sets. Source and File names are retained in this pedigree in original form to enable R-scripts to retain functionality. • Thingvold et al. (1975-1977) • Griffith (1998-2009) • Predicted background (2000-2015) • Water Quality Portal (1996-2021) • Water Quality Portal Less Disturbed (1996-2021) • Minnesota Pollution Control Agency (MPCA) (1996-2013) • Mid-Atlantic Highlands (1990 to 2014). This dataset is associated with the following publication: Cormier, S., and Y. Wang. Appendix C: ORD Specific Conductance Memo, from Susan Cormier to Tera Fong. March 15, 2022. Assessment of effects of increased ion concentrations in the St. Louis River Watershed with special attention to potential mining influence and the jurisdiction of the Fond du Lac Band of Lake Superior Chippewa. U.S. Environmental Protection Agency, Washington, DC, USA, 2022.

  18. d

    Alaska Geochemical Database Version 3.0 (AGDB3) including best value data...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Sep 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Alaska Geochemical Database Version 3.0 (AGDB3) including best value data compilations for rock, sediment, soil, mineral, and concentrate sample media [Dataset]. https://catalog.data.gov/dataset/alaska-geochemical-database-version-3-0-agdb3-including-best-value-data-compilations-for-r
    Explore at:
    Dataset updated
    Sep 17, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The Alaska Geochemical Database Version 3.0 (AGDB3) contains new geochemical data compilations in which each geologic material sample has one best value determination for each analyzed species, greatly improving speed and efficiency of use. Like the Alaska Geochemical Database Version 2.0 before it, the AGDB3 was created and designed to compile and integrate geochemical data from Alaska to facilitate geologic mapping, petrologic studies, mineral resource assessments, definition of geochemical baseline values and statistics, element concentrations and associations, environmental impact assessments, and studies in public health associated with geology. This relational database, created from databases and published datasets of the U.S. Geological Survey (USGS), Atomic Energy Commission National Uranium Resource Evaluation (NURE), Alaska Division of Geological & Geophysical Surveys (DGGS), U.S. Bureau of Mines, and U.S. Bureau of Land Management serves as a data archive in support of Alaskan geologic and geochemical projects and contains data tables in several different formats describing historical and new quantitative and qualitative geochemical analyses. The analytical results were determined by 112 laboratory and field analytical methods on 396,343 rock, sediment, soil, mineral, heavy-mineral concentrate, and oxalic acid leachate samples. Most samples were collected by personnel of these agencies and analyzed in agency laboratories or, under contracts, in commercial analytical laboratories. These data represent analyses of samples collected as part of various agency programs and projects from 1938 through 2017. In addition, mineralogical data from 18,138 nonmagnetic heavy-mineral concentrate samples are included in this database. The AGDB3 includes historical geochemical data archived in the USGS National Geochemical Database (NGDB) and NURE National Uranium Resource Evaluation-Hydrogeochemical and Stream Sediment Reconnaissance databases, and in the DGGS Geochemistry database. Retrievals from these databases were used to generate most of the AGDB data set. These data were checked for accuracy regarding sample location, sample media type, and analytical methods used. In other words, the data of the AGDB3 supersedes data in the AGDB and the AGDB2, but the background about the data in these two earlier versions are needed by users of the current AGDB3 to understand what has been done to amend, clean up, correct and format this data. Corrections were entered, resulting in a significantly improved Alaska geochemical dataset, the AGDB3. Data that were not previously in these databases because the data predate the earliest agency geochemical databases, or were once excluded for programmatic reasons, are included here in the AGDB3 and will be added to the NGDB and Alaska Geochemistry. The AGDB3 data provided here are the most accurate and complete to date and should be useful for a wide variety of geochemical studies. The AGDB3 data provided in the online version of the database may be updated or changed periodically.

  19. OpenML R Bot Benchmark Data (final subset)

    • figshare.com
    application/gzip
    Updated May 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl (2018). OpenML R Bot Benchmark Data (final subset) [Dataset]. http://doi.org/10.6084/m9.figshare.5882230.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 18, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Daniel Kühn; Philipp Probst; Janek Thomas; Bernd Bischl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a clean subset of the data that was created by the OpenML R Bot that executed benchmark experiments on binary classification task of the OpenML100 benchmarking suite with six R algorithms: glmnet, rpart, kknn, svm, ranger and xgboost. The hyperparameters of these algorithms were drawn randomly. In total it contains more than 2.6 million benchmark experiments and can be used by other researchers. The subset was created by taking 500000 results of each learner (except of kknn for which only 1140 results are available). The csv-file for each learner is a table that for each benchmark experiment has a row that contains: OpenML-Data ID, hyperparameter values, performance measures (AUC, accuracy, brier score), runtime, scimark (runtime reference of the machine), and some meta features of the dataset.OpenMLRandomBotResults.RData (format for R) contains all data in seperate tables for the results, the hyperparameters, the meta features, the runtime, the scimark results and reference results.

  20. r

    Street cleaning in City of Yarra

    • researchdata.edu.au
    Updated Oct 2, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Yarra (2019). Street cleaning in City of Yarra [Dataset]. https://researchdata.edu.au/street-cleaning-city-yarra/2980765
    Explore at:
    Dataset updated
    Oct 2, 2019
    Dataset provided by
    data.gov.au
    Authors
    City of Yarra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    Our aim is to make Yarra a clean and pleasant place for our residents to live. This data asset has information about sweeping and loose litter removal across residential roads, kerbs and public open spaces within the Yarra municipality. The street cleansing details include cleaning date and time, suburb where the cleaning was done, category of cleaning, volume of litter removed and cleaning duration.\r \r While all due care has been taken to ensure the data asset is accurate and current, Yarra City Council does not warrant that this data is definitive nor free of error and does not accept responsibility for any loss, damage, claim, expense, cost or liability whatsoever arising from reliance upon information provided herein.\r \r Feedback on the data asset - including compliments, complaints and requests for more detail - is welcome.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
160 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu