75 datasets found
  1. o

    Identifying Missing Data Handling Methods with Text Mining

    • openicpsr.org
    delimited
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Hungarian Academy of Sciences
    Authors
    Krisztián Boros; Zoltán Kmetty
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1999 - Dec 31, 2016
    Description

    Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

  2. Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  3. f

    Results of the ML models were obtained by deleting missing values from the...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jan 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aljrees, Turki (2024). Results of the ML models were obtained by deleting missing values from the dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001270751
    Explore at:
    Dataset updated
    Jan 3, 2024
    Authors
    Aljrees, Turki
    Description

    Results of the ML models were obtained by deleting missing values from the dataset.

  4. Removing missing values NFL Play by Play 2009-2017

    • kaggle.com
    zip
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdallah Ahmed A. (2025). Removing missing values NFL Play by Play 2009-2017 [Dataset]. https://www.kaggle.com/datasets/abdallahahmeda/missing-values-nfl-play-by-play-2009-2017
    Explore at:
    zip(135296739 bytes)Available download formats
    Dataset updated
    Jul 25, 2025
    Authors
    Abdallah Ahmed A.
    Description

    This is my first-ever project on datasets; it was a task assigned to me by my machine learning tutor. I only imputed and removed missing values depending on the context.

    Notes: Down: ffill (logical order)

    Time,TimeSecs,SideofField : FFill

    Playtimediff: Median (has skews)

    yrdln,yrdline100,: Mean

    GoalToGo,FirstDown: Mode

    postteam,DefensiveTeam : assign "None" to NA because it's logical for it to be NA.

    Desc: FFill

    ExPointResult,TwoPointConv,DefTwoPoint,PuntResult: Assign another name "None" to every NA

    Passer,Passer_ID : Remove all rows that has either passer or passer_id missing, but not both, then change other rows that has both to NA to "None"

    PassOutcome,PassLength: Remove all rows that has either passoutcome or passlength missing, then, change NA to "None" as the missing is logical.

    PassLength: setting all NA to None

    Interceptor: assign "None" to NA.

    PassLocation: assign "None" to NA.

    RunLocation,RunGap: use mode for both when RushAttempt.notna() = True, otherwise set to None.

    ReturnResult,Returner,BlockingPlayer,FieldGoalResult,FieldGoalDistance,RecFumbTeam,RecFumbPlayer,ChalReplayResult,PenalizedTeam,PenaltyType,PenalizedPlayer,Timeout_Team : Dropping these columns entirely as they have 90%+ missing values.

    Tackler1,Tackler2: assign "None" to NA.

    DefTeamScore,PosTeamScore,ScoreDiff,AbsScoreDiff: FFill (before and after values are consistently the same unless new match)

    No_Score_Prob,Opp_Field_Goal_Prob,Opp_Safety_Prob,Opp_Touchdown_Prob,Field_Goal_Prob,Safety_Prob,Touchdown_Prob,EPA,Win_Prob: assign "0.0" to missing values

    Away_WP_post,Away_WP_pre,Home_WP_post,Away_WP_post,WPA,airWPA,yacWPA: Mean

    *"None" values are chosen instead of deletion due to the missing value being conditional and not a data gathering error. They are then to be encoded

  5. n

    Data from: Using multiple imputation to estimate missing data in...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated Nov 25, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray (2015). Using multiple imputation to estimate missing data in meta-regression [Dataset]. http://doi.org/10.5061/dryad.m2v4m
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 25, 2015
    Dataset provided by
    Trent University
    University of Prince Edward Island
    Authors
    E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description
    1. There is a growing need for scientific synthesis in ecology and evolution. In many cases, meta-analytic techniques can be used to complement such synthesis. However, missing data is a serious problem for any synthetic efforts and can compromise the integrity of meta-analyses in these and other disciplines. Currently, the prevalence of missing data in meta-analytic datasets in ecology and the efficacy of different remedies for this problem have not been adequately quantified. 2. We generated meta-analytic datasets based on literature reviews of experimental and observational data and found that missing data were prevalent in meta-analytic ecological datasets. We then tested the performance of complete case removal (a widely used method when data are missing) and multiple imputation (an alternative method for data recovery) and assessed model bias, precision, and multi-model rankings under a variety of simulated conditions using published meta-regression datasets. 3. We found that complete case removal led to biased and imprecise coefficient estimates and yielded poorly specified models. In contrast, multiple imputation provided unbiased parameter estimates with only a small loss in precision. The performance of multiple imputation, however, was dependent on the type of data missing. It performed best when missing values were weighting variables, but performance was mixed when missing values were predictor variables. Multiple imputation performed poorly when imputing raw data which was then used to calculate effect size and the weighting variable. 4. We conclude that complete case removal should not be used in meta-regression, and that multiple imputation has the potential to be an indispensable tool for meta-regression in ecology and evolution. However, we recommend that users assess the performance of multiple imputation by simulating missing data on a subset of their data before implementing it to recover actual missing data.
  6. ICR - Identifying Age Related Conditions-Filtered

    • kaggle.com
    zip
    Updated May 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onkur7 (2023). ICR - Identifying Age Related Conditions-Filtered [Dataset]. https://www.kaggle.com/datasets/onkur7/icr-identifying-age-related-conditions-filtered
    Explore at:
    zip(1372977 bytes)Available download formats
    Dataset updated
    May 22, 2023
    Authors
    Onkur7
    Description

    The dataset is created by imputing the missing values of ICR - Identifying Age Related Conditions competition dataset. In this dataset depending on feature selection some subversions are also created. - Version 1 : The version is created by dropping all the rows with missing values. - Version 2 : The version is created by 'BQ' and 'EL' columns which consist most of the missing values. To remove the remaining missing values rows with missing values are deleted. - Version 3 : The version is created by imputing mean values by column average. Median is considered as measure of average. - Version 4 : The version is created by imputing missing values of 'BQ' and 'EL' by linear regression models and remaining missing values are imputed by average value of the column where missing value is present. 'AB', 'AF', 'AH', 'AM', 'CD', 'CF', 'DN', 'FL' and 'GL' are used to calculate the missing values of 'BQ'. 'CU', 'GE' and 'GL' are used to calculate missing values of 'EL'. Models are found in the version4/imputer. Two subversions are created by extraction only important features of the dataset. - Version 5 : The version is created by imputing missing values using KNNImputer. Two subversions are created by extracting only important features. For the categorical feature 'EJ', 'A' is encoded as 0 and 'B' is encoded as '1'. For more details how the transformations of the dataset is done visit this notebook.

  7. Data from: Missing data handling methods

    • kaggle.com
    zip
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krisztián Boros (2024). Missing data handling methods [Dataset]. https://www.kaggle.com/datasets/krisztinboros/missing-data-handling-methods
    Explore at:
    zip(6274510 bytes)Available download formats
    Dataset updated
    Jul 6, 2024
    Authors
    Krisztián Boros
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for the paper "Identifying missing data handling methods with text mining".
    It contains the type of missing data handling method used by a given paper.

    Column description

    id: ID of the article
    origin: Source journal
    pub_year: Publication year
    discipline: Discipline category of the article based on origin
    about_missing: Is the article about missing data handling? (0 - no, 1 - yes)
    imputation: Was some kind of imputation technique used in the article? (0 - no, 1 - yes)
    advanced: Was some kind of advanced imputation technique used in the article? (0 - no, 1 - yes)
    deletion: Was some kind of deletion technique used in the article? (0 - no, 1 - yes)
    text_tokens: Snipped out parts from the original articles

  8. f

    R scripts used for Monte Carlo simulations and data analyses.

    • plos.figshare.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lateef Babatunde Amusa; Twinomurinzi Hossana (2024). R scripts used for Monte Carlo simulations and data analyses. [Dataset]. http://doi.org/10.1371/journal.pone.0297037.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lateef Babatunde Amusa; Twinomurinzi Hossana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R scripts used for Monte Carlo simulations and data analyses.

  9. f

    Description of the dataset used in this study.

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jan 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aljrees, Turki (2024). Description of the dataset used in this study. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001270720
    Explore at:
    Dataset updated
    Jan 3, 2024
    Authors
    Aljrees, Turki
    Description

    Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.

  10. Imputation missing values in the nominal datasets

    • kaggle.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Awsan thabet salem (2023). Imputation missing values in the nominal datasets [Dataset]. https://www.kaggle.com/datasets/awsanthabetsalem/imputation-in-arabic-dataset/data
    Explore at:
    zip(16588335 bytes)Available download formats
    Dataset updated
    Jan 29, 2023
    Authors
    Awsan thabet salem
    Description

    The folder contains three datasets: Zomato restaurants, Restaurants on Yellow Pages, and Arabic poetry. Where all datasets have been taken from Kaggle and made some modifications by adding missing values, where the missing values are referred to as symbol (?). The experiment has been done to experiment with the processes of imputation missing values on nominal values. The missing values in the three datasets are in the range of 10%-80%.

    The Arabic dataset has several modifications as follows: 1. Delete the columns that contain English values such as Id, poem_link, poet link. The reason is the need to evaluate the ERAR method on the Arabic data set. 2. Add diacritical marks to some records to check the effect of diacritical marks during frequent itemset generation. note: the results of the experiment on the Arabic dataset will be find in the paper under the title "Missing values imputation in Arabic datasets using enhanced robust association rules"

  11. Results of the ML models using PCA imputer.

    • plos.figshare.com
    xls
    Updated Jan 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turki Aljrees (2024). Results of the ML models using PCA imputer. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 3, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Turki Aljrees
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.

  12. Machine learning models.

    • plos.figshare.com
    xls
    Updated Jan 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turki Aljrees (2024). Machine learning models. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 3, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Turki Aljrees
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.

  13. d

    Replication Data for: \"The Missing Dimension of the Political Resource...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ranjit, Lall (2023). Replication Data for: \"The Missing Dimension of the Political Resource Curse Debate\" (Comparative Political Studies) [Dataset]. http://doi.org/10.7910/DVN/UHABC6
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Ranjit, Lall
    Description

    Abstract: Given the methodological sophistication of the debate over the “political resource curse”—the purported negative relationship between natural resource wealth (in particular oil wealth) and democracy—it is surprising that scholars have not paid more attention to the basic statistical issue of how to deal with missing data. This article highlights the problems caused by the most common strategy for analyzing missing data in the political resource curse literature—listwise deletion—and investigates how addressing such problems through the best-practice technique of multiple imputation affects empirical results. I find that multiple imputation causes the results of a number of influential recent studies to converge on a key common finding: A political resource curse does exist, but only since the widespread nationalization of petroleum industries in the 1970s. This striking finding suggests that much of the controversy over the political resource curse has been caused by a neglect of missing-data issues.

  14. Iris_with_missing_data

    • kaggle.com
    zip
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bharath Kumar Kathula (2023). Iris_with_missing_data [Dataset]. https://www.kaggle.com/datasets/bharathkumarkathula/iris-with-missing-data/code
    Explore at:
    zip(1319 bytes)Available download formats
    Dataset updated
    May 7, 2023
    Authors
    Bharath Kumar Kathula
    Description

    Iris dataset is taken from Kaggle https://www.kaggle.com/datasets/uciml/iris. Deleted some random values to create the CSV file

  15. e

    ComBat HarmonizR enables the integrated analysis of independently generated...

    • ebi.ac.uk
    Updated May 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hannah Voß (2022). ComBat HarmonizR enables the integrated analysis of independently generated proteomic datasets through data harmonization with appropriate handling of missing values [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD027467
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Hannah Voß
    Variables measured
    Proteomics
    Description

    The integration of proteomic datasets, generated by non-cooperating laboratories using different LC-MS/MS setups can overcome limitations in statistically underpowered sample cohorts but has not been demonstrated to this day. In proteomics, differences in sample preservation and preparation strategies, chromatography and mass spectrometry approaches and the used quantification strategy distort protein abundance distributions in integrated datasets. The Removal of these technical batch effects requires setup-specific normalization and strategies that can deal with missing at random (MAR) and missing not at random (MNAR) type values at a time. Algorithms for batch effect removal, such as the ComBat-algorithm, commonly used for other omics types, disregard proteins with MNAR missing values and reduce the informational yield and the effect size for combined datasets significantly. Here, we present a strategy for data harmonization across different tissue preservation techniques, LC-MS/MS instrumentation setups and quantification approaches. To enable batch effect removal without the need for data reduction or error-prone imputation we developed an extension to the ComBat algorithm, ´ComBat HarmonizR, that performs data harmonization with appropriate handling of MAR and MNAR missing values by matrix dissection The ComBat HarmonizR based strategy enables the combined analysis of independently generated proteomic datasets for the first time. Furthermore, we found ComBat HarmonizR to be superior for removing batch effects between different Tandem Mass Tag (TMT)-plexes, compared to commonly used internal reference scaling (iRS). Due to the matrix dissection approach without the need of data imputation, the HarmonizR algorithm can be applied to any type of -omics data while assuring minimal data loss

  16. First Entries Into Foster Care Reason For Removal

    • data.wu.ac.at
    csv, json, xml
    Updated Jun 3, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kidsdata.org, a program of the Lucile Packard Foundation for Children's Health (2015). First Entries Into Foster Care Reason For Removal [Dataset]. https://data.wu.ac.at/schema/performance_smcgov_org/amhzcy1pdXo1
    Explore at:
    json, csv, xmlAvailable download formats
    Dataset updated
    Jun 3, 2015
    Dataset provided by
    Lucile Packard Foundation for Children's Health
    Description

    Percentage of first entries into foster care for children under age 18, by removal reason (e.g., 8.5% of children entering foster care for the first time in California in 2011-2013 were removed from their families due to physical abuse). First entries into foster care are unduplicated counts of children under the supervision of county welfare departments and exclude cases under the supervision of county probation departments, out-of-state agencies, state adoptions district offices, and Indian child welfare departments. Counts are based on the first out-of-home placement of eight days or more, even if it was not the first actual placement. 'Other' includes removals due to exploitation, child’s disability or handicap, and other reasons. LNE (Low Number Event) refers to data that have been suppressed because there were fewer than 80 total children with first entries. N/A means that data are not available. The sum of all reasons for removal percentages may not add up to 100% due to missing values. Data Source: Needell, B., et al. (May 2014). Child Welfare Services Reports for California, U.C. Berkeley Center for Social Services Research. Retrieved on May 31, 2015.

  17. f

    CRAVE Confirmatory Factor Analysis w/ summed PW move and rest, and RN move...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Dec 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Stults-Kolehmainen, FACSM (2020). CRAVE Confirmatory Factor Analysis w/ summed PW move and rest, and RN move and rest scores. CLEANED.(Study 2).PCM12/02/2020 [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000523244
    Explore at:
    Dataset updated
    Dec 2, 2020
    Authors
    Matthew Stults-Kolehmainen, FACSM
    Description

    CRAVE Confirmatory Factor Analysis w/ summed PW move and rest, and RN move and rest scores. CLEANED.(Study 2).All erroneous values and 999's (previously coded for missing values) have been deleted, leaving the value blank.PCM12/02/2020 Paul C. McKee

  18. g

    Bathymetry (Alaska and surrounding waters) | gimi9.com

    • gimi9.com
    Updated Nov 19, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Bathymetry (Alaska and surrounding waters) | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_bathymetry-alaska-and-surrounding-waters2/
    Explore at:
    Dataset updated
    Nov 19, 2017
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Alaska
    Description

    A Raster having 20 m resolution with decimal values was assembled from 18.6 billion bathymetric soundings that were obtained from the National Center for Environmental Information (NCEI) https://www.ncei.noaa.gov. Bathymetric soundings extends from Kuril-Kamchatka Trench in the Bearing Sea along the Aleutian Trench to the Gulf of Alaska, and in the Arctic Ocean from Prince Patrick Island to the International Date Line. Bathymetric soundings were scrutinized for accuracy using statistical analysis and visual inspection with some imputation. Editing processes included: deleting erroneous and superseded values, digitizing missing values, and referencing all data sets to a common, modern datum.

  19. Overwatch 2 statistics

    • kaggle.com
    zip
    Updated Jun 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics/code
    Explore at:
    zip(67546 bytes)Available download formats
    Dataset updated
    Jun 27, 2023
    Authors
    Mykhailo Kachan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

    The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

    Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

    Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

    The code on GitHub .

    All procedure is done in 5 stages:

    Stage 1:

    Data is retrieved directly from HTML elements on the page with the selenium tool on python.

    Stage 2:

    After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

    Stage 3:

    Data were arranged into a table and saved to CSV.

    Stage 4:

    Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

    Stage 5:

    Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

    The procedure to fetch the data takes 7 minutes on average.

    This project and code were born from this GitHub code.

  20. d

    Data: Prescribed fire enhances seed removal by ants in a Neotropical savanna...

    • datadryad.org
    • search.dataone.org
    • +1more
    zip
    Updated Nov 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirela Alcolea; Giselda Durigan; Alexander Christianini (2021). Data: Prescribed fire enhances seed removal by ants in a Neotropical savanna [Dataset]. http://doi.org/10.5061/dryad.tqjq2bw0t
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 6, 2021
    Dataset provided by
    Dryad
    Authors
    Mirela Alcolea; Giselda Durigan; Alexander Christianini
    Time period covered
    Oct 27, 2021
    Description

    There are no missing values. Please see Readme.txt file for more details about variables.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1

Identifying Missing Data Handling Methods with Text Mining

Explore at:
delimitedAvailable download formats
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Jan 1, 1999 - Dec 31, 2016
Description

Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

Search
Clear search
Close search
Google apps
Main menu