83 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. f

    Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    figshare
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  3. W

    R Code of Simulations

    • cloud.csiss.gmu.edu
    • catalog.data.gov
    zip
    Updated Mar 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2021). R Code of Simulations [Dataset]. http://doi.org/10.23719/1504181
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 7, 2021
    Dataset provided by
    United States
    License

    https://pasteur.epa.gov/license/sciencehub-license.htmlhttps://pasteur.epa.gov/license/sciencehub-license.html

    Description

    The sims zip file contains R code and accompanying files needed to run the R code. Overall this code demonstrates the R code used in the study is fully functional, documented, and reproducible and that this code could reproduce the simulation results from the study with sufficient computing time. The code as presented is for a single simulated dataset and will produce estimates and confidence intervals produced by all the methods used within the study when run on that one dataset.

    This dataset is associated with the following publication: Nethery, R., F. Mealli, J. Sacks, and F. Dominici. Evaluation of the Health Impacts of the 1990 Clean Air Act Amendments Using Causal Inference and Machine Learning. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION. Taylor & Francis Group, London, UK, 1-12, (2020).

  4. Data cleaning EVI2

    • figshare.com
    txt
    Updated May 13, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geraldine Klarenberg (2019). Data cleaning EVI2 [Dataset]. http://doi.org/10.6084/m9.figshare.5327527.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 13, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Geraldine Klarenberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scripts to clean EVI2 data obtained from the VIP lab (University of Arizona) website (https://vip.arizona.edu/about.php and https://vip.arizona.edu/viplab_data_explorer.php). Data obtained in 2012.- outlier detection and removal/replacement- alignment of 2 periodsThe manuscript detailing the methods and resulting data sets has been accepted for publication in Nature Scientific Data (05/11/2019).Instructions: use the R Markdown html file for instructions!Code last manipulated and tested in R 3.4.3 ("Kite-Eating Tree")

  5. d

    Replication Data for: realdata

    • search.dataone.org
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu, Ningning (2023). Replication Data for: realdata [Dataset]. http://doi.org/10.7910/DVN/AFZZVP
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Xu, Ningning
    Description

    (1) dataandpathway_eisner.R, dataandpathway_bordbar.R, dataandpathway_taware.R and dataandpathway_almutawa.R: functions and codes to clean the realdata sets and obtain the annotation databases, which are save as .RData files in sudfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (2) FWER_excess.R: functions to show the inflation of FWER when integrating multiple annotation databases and to generate Table 1. (3) data_info.R: code to obtain Table 2 and Table 3. (4) rejections_perdataset.R and triangulartable.R: functions to generate Table 4. The runing time of rejections_perdataset.R is 7 hours around, we thus save the corresponding results as res_eisner.RData, res_bordbar.RData, res_taware.RData and res_almutawa.RData in subfolders Eisner, Bordbar, Taware and Al-Mutawa respectively. (5) pathwaysizerank.R: code for generating Figure 4 based on res_eisner.RData from (h). (6) iterationandtime_plot.R: code for generating Figure 5 based on “Al-Mutawa” data. The code is really time-consuming, nearly 5 days, we thus save the corresponding results and plot them in the main manuscript by pgfplot.

  6. A dataset for temporal analysis of files related to the JFK case

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Luczak-Roesch; Markus Luczak-Roesch (2020). A dataset for temporal analysis of files related to the JFK case [Dataset]. http://doi.org/10.5281/zenodo.1042154
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Markus Luczak-Roesch; Markus Luczak-Roesch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.

    The code to derive the dataset is given as follows:

    ### BEGIN R DATA PROCESSING SCRIPT

    library(tesseract)
    library(pdftools)

    pdfs <- list.files("[path to your output directory containing all PDF files]")

    meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)

    meta$Doc.Date <- as.character(meta$Doc.Date)

    meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),]
    for(i in 1:nrow(meta.clean)){
    meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])

    if(nchar(meta.clean$Doc.Date[i])<10){
    meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y")
    }

    }

    meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")

    meta.clean <- meta.clean[order(meta.clean$Doc.Date),]

    docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F)
    for(i in 1:nrow(meta.clean)){
    #for(i in 1:3){
    pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])))
    tmp_files <- c()
    for(k in 1:pdf_prop$pages){
    tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k))
    }

    img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)

    txt <- ""

    for(j in 1:length(img_file)){
    extract <- ocr(img_file[j], engine = tesseract("eng"))
    #unlink(img_file)
    txt <- paste(txt,extract,collapse = " ")
    }

    docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F)
    }


    write.table(docs,"[path to your output directory]/documents.csv", row.names = F)

    ### END R DATA PROCESSING SCRIPT

  7. H

    Replication Data for Reconceptualising dimensions of political competition...

    • dataverse.harvard.edu
    bin +3
    Updated Feb 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2019). Replication Data for Reconceptualising dimensions of political competition in Europe: A demand side approach [Dataset]. http://doi.org/10.7910/DVN/1B1MXY
    Explore at:
    tsv(331249), tsv(327), type/x-r-syntax(4522), tsv(485093), tsv(661203), tsv(14761), tsv(12441), tsv(14234), tsv(2085524), type/x-r-syntax(1892), type/x-r-syntax(4707), tsv(2549768), tsv(13422), tsv(334), text/plain; charset=us-ascii(1705), type/x-r-syntax(4615), tsv(321), tsv(393), type/x-r-syntax(1641), tsv(7451), type/x-r-syntax(4554), text/plain; charset=us-ascii(1112), type/x-r-syntax(4494), tsv(341458), text/plain; charset=us-ascii(998), type/x-r-syntax(2015), type/x-r-syntax(1898), tsv(836486), tsv(2746470), type/x-r-syntax(1458), type/x-r-syntax(1446), tsv(294), tsv(15880), tsv(1029255), type/x-r-syntax(2689), tsv(58230734), text/plain; charset=us-ascii(1056), tsv(720450), tsv(5435950), tsv(1008553), type/x-r-syntax(5393), tsv(1020458), tsv(309185), tsv(7594506), type/x-r-syntax(4718), text/plain; charset=us-ascii(1034), tsv(3740969), tsv(11855551), tsv(6791), tsv(2772257), text/plain; charset=us-ascii(1057), tsv(12381), tsv(12522), tsv(292258), tsv(2985998), tsv(192306), tsv(16127), type/x-r-syntax(4560), text/plain; charset=us-ascii(1018), text/plain; charset=us-ascii(1690), tsv(464459), tsv(341), tsv(26223059), text/plain; charset=us-ascii(1687), tsv(3236664), tsv(312), tsv(578743), tsv(16375), tsv(13536), tsv(12739), type/x-r-syntax(4669), tsv(1213762), tsv(311), tsv(10925925), tsv(445801), tsv(333), tsv(277486), tsv(3827698), type/x-r-syntax(4474), type/x-r-syntax(1844), type/x-r-syntax(2557), tsv(310), tsv(4484737), tsv(248936), tsv(437424), tsv(1009024), bin(3926), text/plain; charset=us-ascii(1069), type/x-r-syntax(4789), text/plain; charset=us-ascii(1047), tsv(30352394), tsv(4237412), tsv(1068062), text/plain; charset=us-ascii(1052), tsv(18862), tsv(231429), text/plain; charset=us-ascii(1073), text/plain; charset=us-ascii(13523), tsv(13416), tsv(1270018), type/x-r-syntax(4578), type/x-r-syntax(1634), tsv(115353772), type/x-r-syntax(4719), tsv(325), type/x-r-syntax(1602), tsv(26412407), text/plain; charset=us-ascii(1697), tsv(12719), type/x-r-syntax(4676), tsv(15222838), tsv(4371891), tsv(17128), tsv(292), tsv(295), tsv(361), tsv(3941101), tsv(13338), tsv(1590977), type/x-r-syntax(1535), type/x-r-syntax(4592), type/x-r-syntax(1740), tsv(14578), tsv(1271358), tsv(56364949), tsv(3679697), type/x-r-syntax(1967), tsv(313), tsv(22756302), tsv(453248), type/x-r-syntax(4527), tsv(721829), tsv(9489041), tsv(17466), tsv(316), tsv(3741064), tsv(514401), text/plain; charset=us-ascii(1681), tsv(1553558), text/plain; charset=us-ascii(1704), type/x-r-syntax(4521), tsv(3084752), tsv(7164362), type/x-r-syntax(1979), tsv(3493717), type/x-r-syntax(2000), type/x-r-syntax(1888), type/x-r-syntax(4824), tsv(315), tsv(829525), tsv(16746603), type/x-r-syntax(5118), tsv(1124251), tsv(6755683), type/x-r-syntax(4744), type/x-r-syntax(4539), tsv(331), tsv(12897), type/x-r-syntax(1711), type/x-r-syntax(1607), type/x-r-syntax(4607), tsv(13177), text/plain; charset=us-ascii(1686), type/x-r-syntax(4514), tsv(326), type/x-r-syntax(4570), tsv(528864)Available download formats
    Dataset updated
    Feb 26, 2019
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Included are: 1. The raw data (before cleaning and preprocessing) can be found in the files ending "Raw3". The codebooks for each of these data files end in "codebook". This will enable the user to identify the statements that are associated with the items EU1 … 7, Eco1 … 7, Cul1 … 7, AD1 and AD2 that are used in the manuscript.// 2. The R codes ending cleaning_plus.R are used to a) clean the datasets according to the procedure outlined in the online Appendix and b) remove entries with missing values for any of the variables that are used in the calibration process to produce balanced datasets (age, education, gender, political interest). Because of step b), the new datasets generated will be smaller than the clean datasets listed in Table 1 of the Appendix.// 3. For the balancing and calibrating (pre-processing), we use a) the datasets for each country generated by 2 above (the files that are followed by the suffix "_clean"), b) the file drop.py, which is the code (in python) for the balancing algorithm that is based on the principle of raking (see the online Appendix), c) the R files that are used to generate the new calibrated datasets that will be used in the Mokken Scale analysis in 5 below (followed by the suffix "balCode"), and d) a set of files ending in the suffix "estimates" that contain the joint distributions derived from the ESS data (i) for age, below versus above the median age and (ii) for education, degree versus no degree, as well as the marginal distributions for gender and political interest. The median ages of the voting population derived from ESS are as follows: Austria: 50 Bulgaria: 52 Croatia: 52 Cyprus: 47 Czech Republic 50 Denmark: 50 England: 53 Estonia: 50 Finland: 54 France: 55 Germany: 53 Greece: 50 Hungary: 49 Ireland: 50 Italy: 50 Lithuania: 53 Poland: 50 Portugal: 52 Romania: 46 Slovakia: 52 Slovenia: 52 Spain: 50// 4. A set of data files with the suffix myBal, which contain the new calibrated datasets that will be used in the Mokken Scale analysis in 5 (below).// 5. A set of R codes for each country, beginning with the prefix "RCodes" that are used to generate the findings on dimensionality that are presented in the manuscript.

  8. d

    Post experiment cleanup and charts [R code] for: Tracking the Decoy

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    van Emden, Robin; Iannuzzi, Davide; Kaptein, Maurits (2023). Post experiment cleanup and charts [R code] for: Tracking the Decoy [Dataset]. http://doi.org/10.7910/DVN/URXZX5
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    van Emden, Robin; Iannuzzi, Davide; Kaptein, Maurits
    Description

    R code for post experiment data cleanup and plots.

  9. Data from: Forests to Faucets 2.0

    • figshare.com
    • catalog.data.gov
    • +4more
    bin
    Updated Nov 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Forest Service (2024). Forests to Faucets 2.0 [Dataset]. https://figshare.com/articles/dataset/Forests_to_Faucets_2_0/27886854
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 23, 2024
    Dataset provided by
    U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
    Authors
    U.S. Forest Service
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Forests to Faucets 2.0 builds upon the national Forests to Faucets(2011) by updating base data and adding new threats including wildfire, invasive pests, and future stresses such as climate-induced changes in land use and water quantity. The purpose of this project is to quantify, rank, and illustrate the geographic connection between forests and other natural cover (private and public), surface drinking water supplies, and the populations that depend on them–the ecosystem service of water supply. The project assesses subwatersheds across the US to identify those important to downstream surface drinking water supplies as well as evaluate a subwatersheds natural ability to produce clean water based on its biophysical characteristics: percent natural cover, percent agricultural land, percent impervious, percent riparian natural cover, and mean annual water yield. Using data from a variety of existing sources and maps generated through GIS analyses, the project uses maps and statistics to describe the relative importance of private forests and National Forest System lands to surface drinking water supplies across the United States. The data produced by this assessment provides information needed to identify opportunities for water market approaches or schemes based upon payments for environmental services (PES).September 2023 Update: Water yields (Q_YLD_MM; PER_Q40_45; PER_Q90_45; PER_Q40_85; PER_Q90_85) were updated to tie back to WASSI source data. All Forests to Faucets models and indices were recalculated. HUCs that did not have a corresponding water yield from WASSI were recalculated to the nearest HUC. AttributeDescriptionSourcesAcresHUC 12 AcresCalculated using ArcGISSTATESStatesWBD 2019HUC1212-digit Hydrologic Unit CodeWBD 2019NAME12-digit Hydrologic Unit NameWBD 2019HUTYPEHUC TypeWBD 2019From_HUC 12 From for routingWBD 2019ToHUC 12 To for routingWBD 2019 (edited by USDA FS)LevelHUC levelCalculated HUC level from outlet (1) to headwater (351)NLCDAcres of NLCDNLCDPER_NLCDPercent of HUC with NLCD DataCalculated using ArcGISFOREST_ACAcres of all forestNLCD Forest = 41,42,43, 90NLCDPER_FORPercent ForestCalculated using ArcGISAG_ACAcres of agricultural landNLCD = 81,82NLCDPER_AGPercent agricultural landCalculated using ArcGISIMPV_ACAcres of ImperviousNLCDPER_IMPVPercent ImperviousCalculated using ArcGISNATCOVER_ACAcres of Natural CoverNLCD = 11,12,41,42,43,51,52,71,90,95NLCDPER_NATCOVPercent Natural CoverCalculated using ArcGISRIPNAT_ACAcres of riparian natural coverSinan AboodPER_RIPNATPercent riparian natural coverCalculated using ArcGISQ_YLD_MM Mean Annual Water Yield in mm (Q) based on the historical time period (1961 to 2015). Baseline water yield for 2010.WASSI , Updated September 2023R_NATCOVNatural Cover score for APCWCalculated using ArcGISR_AGAgricultural land score for APCWCalculated using ArcGISR_IMPVImpervious Surface score for APCWCalculated using ArcGISR_RIPRiparian Natural Cover score for APCWCalculated using ArcGISR_QMean Annual Water Yield score for APCWCalculated using ArcGISAPCWAbility to Produce Clean Water (APCW)= (R_NATCOV+R_AG+R_IMPV+R_RIP) * R_QCalculated using ArcGISAPCW_RAPCW Score (0-100 Quantiles)Calculated using RGWNumber of groundwater water intakesSDWISSWNumber of surface water intakes (includes GU, groundwater under the influence of surface water)SDWISGW_POPNumber of groundwater water consumersSDWISSW_POPNumber of surface water consumers (includes GU, groundwater under the influence of surface water)SDWISGL_IntakesNumber of surface water intakes in the Great Lakes*SDWISGL_POPNumber of surface water consumers in the Great Lakes*SDWISSUM_POPP, Total number of Surface water consumers SW_POP+ GL_POPSDWISPRDrinking water protection model PRn = ∑(Wi* Pi)Calculated using RPOP_DSSum of surface drinking water population downstream of HUC 12 ∑(SUM_POPn) ***Cannot be summed across multiple HUC12 due to double counting down stream populations. Only accurate for an individual HUC12.Calculated using RIMPRaw Important Areas for Surface Drinking Water (IMP) Value. As developed in Forests to Faucets (USFS 2011), the Important Areas for Surface Drinking Water (IMP) model can be broken down into two parts: IMPn = (PRn) * (Qn)Calculated using R, Updated September 2023IMP_RIMP, Important Areas for Surface Drinking Water (0-100 Quantiles)Calculated using R, Updated September 2023NON_FORESTAcres of non-forestPADUS and NLCDPRIVATE_FORESTAcres of private forestPADUS and NLCDPROTECTED_FORESTAcres of protected forest (State, Local, NGO, Permanent Easement)PADUS, NCED, and NLCDNFS_FORESTAcres of National Forest System (NFS) forestPADUS and NLCDFEDERAL_FORESTAcres of Other Federal forest (Non-NFS Federal)PADUS and NLCDPER_FORPRIPercent Private ForestCalculated using ArcGISPER_FORNFSPercent NFS ForestCalculated using ArcGISPER_FORPROPercent Protected (Other State, Local, NGO, Permanent Easement, NFS, and Federal) ForestCalculated using ArcGISWFP_HI_ACAcres with High and Very High Wildfire Hazard Potential (WHP)Dillon, 2018PER_WFPPercent of HU 12 with High and Very High Wildfire Hazard Potential (WHP)Dillon, 2018PER_IDRISKPercent of HU 12 that is at risk for mortality - 25% of standing live basal area greater than one inch in diameter will die over a 15- year time frame (2013 to 2027) due to insects and diseases.Krist, et Al,. 2014PERDEV_1040_45% Landuse Change 2010-2040 (low)ICLUSPERDEV_1090_45% Landuse Change 2010-2090 (low)ICLUSPERDEV_1040_85% Landuse Change 2010-2040 (high)ICLUSPERDEV_1090_85% Landuse Change 2010-2090 (high)ICLUSPER_Q40_45% Water Yield Change 2010-2040 (low) WASSI , Updated September 2023PER_Q90_45% Water Yield Change 2010-2090 (low) WASSI , Updated September 2023PER_Q40_85% Water Yield Change 2010-2040 (high) WASSI , Updated September 2023PER_Q90_85% Water Yield Change 2010-2090 (high) WASSI , Updated September 2023WFP(APCW_R * IMP_R * PER_WFP )/ 10,000Wildfire Threat to Important Surface Drinking Water Watersheds Calculated using ArcGIS, Updated September 2023IDRISK(APCW_R * IMP_R * PER_IDRISK )/ 10,000Insect & Disease Threat to Important Surface Drinking Water Watersheds Calculated using ArcGIS, Updated September 2023DEV1040_45(APCW_R * IMP_R * PERDEV_1040_45)/ 10,000 Landuse Change in Important Surface Drinking Water Watersheds 2010-2040 (low emissions) Calculated using ArcGIS, Updated September 2023DEV1090_45(APCW_R * IMP_R * PERDEV_1090_45)/ 10,000 Landuse Change in Important Surface Drinking Water Watersheds 2010-2040 (high emissions) Calculated using ArcGIS, Updated September 2023DEV1040_85(APCW_R * IMP_R * PERDEV_1040_85)/ 10,000 Landuse Change in Important Surface Drinking Water Watersheds 2010-2090 (low emissions) Calculated using ArcGIS, Updated September 2023DEV1090_85(APCW_R * IMP_R * PERDEV_1090_85)/ 10,000 Landuse Change in Important Surface Drinking Water Watersheds 2010-2090 (high emissions) Calculated using ArcGIS, Updated September 2023Q1040_45-1 * (APCW_R * IMP_R * PER_Q40_45)/ 10,000 Water Yield Decrease in Important Surface Drinking Water Watersheds 2010-2040 (low emissions) Calculated using ArcGIS, Updated September 2023Q1090_45-1 * (APCW_R * IMP_R * PER_Q90_45)/ 10,000 Water Yield Decrease in Important Surface Drinking Water Watersheds 2010-2040 (high emissions) Calculated using ArcGIS, Updated September 2023Q1040_85-1 * (APCW_R * IMP_R * PER_Q40_85)/ 10,000 Water Yield Decrease in Important Surface Drinking Water Watersheds 2010-2090 (low emissions) Calculated using ArcGIS, Updated September 2023Q1090_85-1 * (APCW_R * IMP_R * PER_Q90_85)/ 10,000 Water Yield Decrease in Important Surface Drinking Water Watersheds 2010-2090 (high emissions) Calculated using ArcGIS, Updated September 2023WFP_IMP_RWildfire Threat to Important Surface Drinking Water Watersheds (0-100 Quantiles)Calculated using R, Updated September 2023IDRISK_RInsect & Disease Threat to Important Surface Drinking Water Watersheds (0-100 Quantiles)Calculated using R, Updated September 2023DEV40_45_RLanduse Change in Important Surface Drinking Water Watersheds 2010-2040 (low emissions) (0-100 Quantiles)Calculated using R, Updated September 2023DEV40_85_RLanduse Change in Important Surface Drinking Water Watersheds 2010-2040 (high emissions) (0-100 Quantiles)Calculated using R, Updated September 2023DEV90_45_RLanduse Change in Important Surface Drinking Water Watersheds 2010-2090 (low emissions) (0-100 Quantiles)Calculated using R, Updated September 2023DEV90_85_RLanduse Change in Important Surface Drinking Water Watersheds 2010-2090 (high emissions) (0-100 Quantiles)Calculated using R, Updated September 2023Q40_45_RWater Yield Decrease in Important Surface Drinking Water Watersheds 2010-2040 (low emissions) (0-100 Quantiles)Calculated using R, Updated September 2023Q40_85_RWater Yield Decrease in Important Surface Drinking Water Watersheds 2010-2040 (high emissions) (0-100 Quantiles)Calculated using R, Updated September 2023Q90_45_RWater Yield Decrease in Important Surface Drinking Water Watersheds 2010-2090 (low emissions) (0-100 Quantiles)Calculated using R, Updated September 2023Q90_85_RWater Yield Decrease in Important Surface Drinking Water Watersheds 2010-2090 (high emissions) (0-100 Quantiles)Calculated using R, Updated September 2023RegionUS Forest Service Region numberUSFSRegionnameUS Forest Service Region nameUSFSHUC_Num_DiffThis field compares the value in column HUC12(circa 2019 wbd) with the value in HUC_12 (circa 2009 wassi)-1 = No equivalent WASSI HUC. Water yield (Q_YLD_MM) was estimating using the nearest HUC.USFS, Updated September 2023HUC_12_WASSIWASSI HUC numberWASSI, Updated September 2023This record was taken from the USDA Enterprise Data Inventory that feeds into the https://data.gov catalog. Data for this record includes the following resources: ISO-19139 metadata ArcGIS Hub Dataset ArcGIS GeoService CSV Shapefile GeoJSON KML For complete information, please visit https://data.gov.

  10. Examining Policy Impacts on Racial Disparities in Federal Sentencing Across...

    • icpsr.umich.edu
    ascii, delimited, r +3
    Updated Apr 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McGilton, Mari (2024). Examining Policy Impacts on Racial Disparities in Federal Sentencing Across Stages and Groups and over Time, [United States], 1998-2021 [Dataset]. http://doi.org/10.3886/ICPSR38647.v1
    Explore at:
    spss, r, sas, delimited, ascii, stataAvailable download formats
    Dataset updated
    Apr 25, 2024
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    McGilton, Mari
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/38647/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/38647/terms

    Area covered
    United States
    Description

    In this secondary analysis, the research team used publicly available federal sentencing data from the United States Sentencing Commission (USSC) to measure racial disparities for multiple race groups and stages of sentencing across time (fiscal years 1999-2021). They sought to answer the following research questions: Do racial disparities vary across 3 stages of federal sentencing and over time? If so, how? During which years do the measured racial disparities have a statistically significant decrease? Which policies likely impacted these decreases the most? What are the commonalities between them? To answer the research questions, the research team measured racial disparities between matched cases across three stages of federal sentencing, represented by two elements each; identified at which points in time the disparities changed significantly using time series plots and structured break analyses; and used this information to systematically review federal policies to identify which might have contributed to significant decreases in racial disparities. This collection contains 1 analytic dataset (n = 1,281,732) containing 27 key variables for all fiscal years and the code/syntax used to complete the secondary analysis: 5 files to compile and clean the original data and produce matched datasets (3 R, 1 SAS, 1 Stata) 6 files to analyze sentences by race (all R) 4 files to analyze sentences by federal sentencing guideline (all R) 11 files to analyze sentences by circuit court (all R) Please refer to the Data Sources metadata field and accompanying documentation for details on obtaining the original data.

  11. h

    libritts-r-mimi

    • huggingface.co
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Keisling (2024). libritts-r-mimi [Dataset]. https://huggingface.co/datasets/jkeisling/libritts-r-mimi
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2024
    Authors
    Jacob Keisling
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LibriTTS-R Mimi encoding

    This dataset converts all audio in the dev.clean, test.clean, train.100 and train.360 splits of the LibriTTS-R dataset from waveforms to tokens in Kyutai's Mimi neural codec. These tokens are intended as targets for DualAR audio models, but also allow you to simply download all audio in ~50-100x less space, if you're comfortable decoding later on with rustymimi or Transformers. This does NOT contain the original audio, please use the regular LibriTTS-R for… See the full description on the dataset page: https://huggingface.co/datasets/jkeisling/libritts-r-mimi.

  12. Clean Cyclistic Data

    • kaggle.com
    Updated Sep 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric R. (2021). Clean Cyclistic Data [Dataset]. https://www.kaggle.com/ericramoscastillo/clean-cyclistic-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Eric R.
    Description

    Dataset

    This dataset was created by Eric R.

    Contents

  13. Bitter Creek Analysis Pedigree Data

    • catalog.data.gov
    • s.cnmilf.com
    Updated Sep 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Bitter Creek Analysis Pedigree Data [Dataset]. https://catalog.data.gov/dataset/bitter-creek-analysis-pedigree-data
    Explore at:
    Dataset updated
    Sep 25, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These data sets contain raw and processed data used in for analyses, figures, and tables in the Region 8 Memo: Characterization of chloride and conductivity levels in the Bitter Creek Watershed, WY. However, these data may be used for other analyses alone or in combination with other or new data. These data were used to assess whether chloride levels are naturally high in streams in the Bitter Creek, WY watershed and how chloride concentrations expected to protect 95 percent of aquatic genera in these streams compare to Wyoming’s chloride criteria applicable to the Bitter Creek watershed. Owing to the arid conditions, background conductivity and chloride levels were characterized for surface flow and ground water flow conditions. Natural chloride levels were found to be less than current water quality criteria for Wyoming. Although the report was prepared for USEPA Region 8 and OST, Office of Water, the report will be of interest to the WDEQ, Sweetwater County Conservation District, and the regulated community. No formal metadata standard was used. Pedigree.xlsx contains: 1. NOTES: Description of work and other worksheets. 2. Pedigree_Summary: Source files used to create figures and tables. 3. DataFiles: Data files used in the R code for creating the figures and tables 4. R_Script: Summary of the R scripts. 5. DataDictionary: Data file titles in all data files Folders: _Datasets Data file uploaded to Environmental Dataset Gateway "A list of subfolders: _R: Clean R scripts used to generate document figures and tables _Tables_Figures: Files generated from R script and used in the Region 6 memo R Code and Data: All additional files used for this project, including original files, intermediate files, extra output files, and extra functions the ""_R"" folder stores R scripts for input and output files and an R project file.. Users can open the R project and run R scripts directly from the ""_R"" folder or the XC95 folder by installing R, RStudio, and associated R packages."

  14. W

    Data from: Technical reviews of cleanup and R and D results. Final technical...

    • cloud.csiss.gmu.edu
    • data.wu.ac.at
    html
    Updated Aug 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Energy Data Exchange (2019). Technical reviews of cleanup and R and D results. Final technical progress report, March 15, 1982-December 30, 1983 [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/technical-reviews-of-cleanup-and-r-and-d-results-final-technical-progress-report-march-15-1982-
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Aug 8, 2019
    Dataset provided by
    Energy Data Exchange
    Description

    SAI reviewed for METC several reports on hot gas cleanup of flue gas, flue gas desulfurization methods and on materials and research programs on heat engines. The work done is listed here without technical discussion. (LTN)

  15. 4

    Scripts for cleaning and analysis of data from SOFC experiment on...

    • data.4tu.nl
    zip
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scripts for cleaning and analysis of data from SOFC experiment on inclination test-bench. [Dataset]. https://data.4tu.nl/datasets/ed0a0cff-7af9-4d3a-baf7-aab5efe39bd1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 27, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Berend van Veldhuizen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2023
    Dataset funded by
    European Commission
    Description

    This data set contains the scripts used for importing, trimming, cleaning, analysing, and plotting a large dataset of inclination experiments with an SOFC module. The measurement data is confidential, so it could not be published alongside the scripts. One row of dummy input data is published to illustrate the structure of the analysed data. The analysis is used for the journal paper "Experimental Evaluation of a Solid Oxide Fuel Cell System Exposed to Inclinations and Accelerations by Ship Motions".

    The scripts contain:

    - A script that reads the data, removes unusable data and transforms into analysable dataframes (Clean and trim.R)

    - Two files to make a wide variety of plots (Plotting.R and Specificplots.R)

    - A file data does a Gaussian Progress regression to estimate the degradation rate (Degradation estimation.R)

  16. Video game pricing analytics dataset

    • kaggle.com
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivi Deveshwar (2023). Video game pricing analytics dataset [Dataset]. https://www.kaggle.com/datasets/shivideveshwar/video-game-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivi Deveshwar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The review dataset for 3 video games - Call of Duty : Black Ops 3, Persona 5 Royal and Counter Strike: Global Offensive was taken through a web scrape of SteamDB [https://steamdb.info/] which is a large repository for game related data such as release dates, reviews, prices, and more. In the initial scrape, each individual game has two files - customer reviews (Count: 100 reviews) and price time series data.

    To obtain data on the reviews of the selected video games, we performed web scraping using R software. The customer reviews dataset contains the date that the review was posted and the review text, while the price dataset contains the date that the price was changed and the price on that date. In order to clean and prepare the data we first start by sectioning the data in excel. After scraping, our csv file fits each review in one row with the date. We split the data, separating date and review, allowing them to have separate columns. Luckily scraping the price separated price and date, so after the separating we just made sure that every file had similar column names.

    After, we use R to finish the cleaning. Each game has a separate file for prices and review, so each of the prices is converted into a continuous time series by extending the previously available price for each date. Then the price dataset is combined with its respective in R on the common date column using left join. The resulting dataset for each game contains four columns - game name, date, reviews and price. From there, we allow the user to select the game they would like to view.

  17. H

    Data and Code (in R) for predicting Financial Distress in Latin America

    • dataverse.harvard.edu
    • dataone.org
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flavio Barboza (2023). Data and Code (in R) for predicting Financial Distress in Latin America [Dataset]. http://doi.org/10.7910/DVN/ODLGNJ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Flavio Barboza
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Latin America
    Description

    Clean Database (21 years) and Code in R for predicting Financial Distress in Latin America.

  18. aggregate-data-italian-cities-from-wikipedia

    • kaggle.com
    Updated May 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    alepuzio (2020). aggregate-data-italian-cities-from-wikipedia [Dataset]. https://www.kaggle.com/alepuzio/aggregatedataitaliancitiesfromwikipedia/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 20, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    alepuzio
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    This dataset is the result of my study on web-scraping of English Wikipedia in R and my tests on regression and classification modelization in R.

    Content

    The content is create by reading the appropriate articles in English Wikipedia about Italian cities: I did'nt run NPL analisys but only the table with the data and I ranked every city from 0 to N in every aspect. About the values, 0 means "*the city is not ranked in this aspect*" and N means "*the city is at first place, in descending order of importance, in this aspect* ". If there's no ranking in a particular aspect (for example, the only existence of the airports/harbours with no additional data about the traffic or the size), then 0 means "*no existence*" and N means "*there are N airports/harbours*". The only not-numeric column is the column with the name of the cities in English form, except some exceptions (for example, "*Bra (CN)* " because of simplicity.

    Acknowledgements

    I acknowledge the Wikimedia Foundation for his work, his mission and to make available the cover image of this dataset, (please read the article "The Ideal city (painting)") . I acknowledge too StackOverflow and Cross-Validated to be the most important focus of technical knowledge in the world, all the people in Kaggle for the suggestions.

    Inspiration

    As a beginner in data analisys and modelization (Ok, I passed the exam of statistics in Politecnico di Milano (Italy), but there are more than 10 years that I don't work in this topic and my memory is getting old ^_^) I worked more on data clean, dataset building and building the simplest modelization.

    You can use this datase to realize which city is good to live or to expand this to add some other data from Wikipedia (not only reading the tables but too to read the text adn extrapolate the data from the meaningless text.)

  19. Weapons_clean Dataset

    • universe.roboflow.com
    zip
    Updated Nov 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vorteg.r@gmail.com (2021). Weapons_clean Dataset [Dataset]. https://universe.roboflow.com/vorteg-r-gmail-com/weapons_clean
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 2, 2021
    Dataset provided by
    Gmailhttp://gmail.com/
    Authors
    vorteg.r@gmail.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Weapon Bounding Boxes
    Description

    Weapons_clean

    ## Overview
    
    Weapons_clean is a dataset for object detection tasks - it contains Weapon annotations for 4,404 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  20. A

    ‘US Minimum Wage by State from 1968 to 2020’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘US Minimum Wage by State from 1968 to 2020’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-us-minimum-wage-by-state-from-1968-to-2020-850a/04ae742e/?iid=018-239&v=presentation
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Analysis of ‘US Minimum Wage by State from 1968 to 2020’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lislejoem/us-minimum-wage-by-state-from-1968-to-2017 on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    US Minimum Wage by State from 1968 to 2020

    The Basics

    • What is this? In the United States, states and the federal government set minimum hourly pay ("minimum wage") that workers can receive to ensure that citizens experience a minimum quality of life. This dataset provides the minimum wage data set by each state and the federal government from 1968 to 2020.

    • Why did you put this together? While looking online for a clean dataset for minimum wage data by state, I was having trouble finding one. I decided to create one myself and provide it to the community.

    • Who do we thank for this data? The United States Department of Labor compiles a table of this data on their website. I took the time to clean it up and provide it here for you. :) The GitHub repository (with R Code for the cleaning process) can be found here!

    Content

    This is a cleaned dataset of US state and federal minimum wages from 1968 to 2020 (including 2020 equivalency values). The data was scraped from the United States Department of Labor's table of minimum wage by state.

    Description of Data

    The values in the dataset are as follows: - Year: The year of the data. All minimum wage values are as of January 1 except 1968 and 1969, which are as of February 1. - State: The state or territory of the data. - State.Minimum.Wage: The actual State's minimum wage on January 1 of Year. - State.Minimum.Wage.2020.Dollars: The State.Minimum.Wage in 2020 dollars. - Federal.Minimum.Wage: The federal minimum wage on January 1 of Year. - Federal.Minimum.Wage.2020.Dollars: The Federal.Minimum.Wage in 2020 dollars. - Effective.Minimum.Wage: The minimum wage that is enforced in State on January 1 of Year. Because the federal minimum wage takes effect if the State's minimum wage is lower than the federal minimum wage, this is the higher of the two. - Effective.Minimum.Wage.2020.Dollars: The Effective.Minimum.Wage in 2020 dollars. - CPI.Average: The average value of the Consumer Price Index in Year. When I pulled the data from the Bureau of Labor Statistics, I selected the dataset with "all items in U.S. city average, all urban consumers, not seasonally adjusted". - Department.Of.Labor.Uncleaned.Data: The unclean, scraped value from the Department of Labor's website. - Department.Of.Labor.Cleaned.Low.Value: The State's lowest enforced minimum wage on January 1 of Year. If there is only one minimum wage, this and the value for Department.Of.Labor.Cleaned.High.Value are identical. (Some states enforce different minimum wage laws depending on the size of the business. In states where this is the case, generally, smaller businesses have slightly lower minimum wage requirements.) - Department.Of.Labor.Cleaned.Low.Value.2020.Dollars: The Department.Of.Labor.Cleaned.Low.Value in 2020 dollars. - Department.Of.Labor.Cleaned.High.Value: The State's higher enforced minimum wage on January 1 of Year. If there is only one minimum wage, this and the value for Department.Of.Labor.Cleaned.Low.Value are identical. - Department.Of.Labor.Cleaned.High.Value.2020.Dollars: The Department.Of.Labor.Cleaned.High.Value in 2020 dollars. - Footnote: The footnote provided on the Department of Labor's website. See more below.

    Data Footnotes

    As laws differ significantly from territory to territory, especially relating to whom is protected by minimum wage laws, the following footnotes are located throughout the data in Footnote to add more context to the minimum wage. The original footnotes can be found here.

    --- Original source retains full ownership of the source dataset ---

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
154 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu