75 datasets found

o
Identifying Missing Data Handling Methods with Text Mining
openicpsr.org
delimited
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E185961V1
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1999 - Dec 31, 2016
Description
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
f
Results of the ML models were obtained by deleting missing values from the...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jan 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aljrees, Turki (2024). Results of the ML models were obtained by deleting missing values from the dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001270751
Explore at:
Dataset updated
Jan 3, 2024
Authors
Aljrees, Turki
Description
Results of the ML models were obtained by deleting missing values from the dataset.
Removing missing values NFL Play by Play 2009-2017
kaggle.com
zip
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdallah Ahmed A. (2025). Removing missing values NFL Play by Play 2009-2017 [Dataset]. https://www.kaggle.com/datasets/abdallahahmeda/missing-values-nfl-play-by-play-2009-2017
Explore at:
zip(135296739 bytes)Available download formats
Dataset updated
Jul 25, 2025
Authors
Abdallah Ahmed A.
Description
This is my first-ever project on datasets; it was a task assigned to me by my machine learning tutor. I only imputed and removed missing values depending on the context.

Notes: Down: ffill (logical order)

Time,TimeSecs,SideofField : FFill

Playtimediff: Median (has skews)

yrdln,yrdline100,: Mean

GoalToGo,FirstDown: Mode

postteam,DefensiveTeam : assign "None" to NA because it's logical for it to be NA.

Desc: FFill

ExPointResult,TwoPointConv,DefTwoPoint,PuntResult: Assign another name "None" to every NA

Passer,Passer_ID : Remove all rows that has either passer or passer_id missing, but not both, then change other rows that has both to NA to "None"

PassOutcome,PassLength: Remove all rows that has either passoutcome or passlength missing, then, change NA to "None" as the missing is logical.

PassLength: setting all NA to None

Interceptor: assign "None" to NA.

PassLocation: assign "None" to NA.

RunLocation,RunGap: use mode for both when RushAttempt.notna() = True, otherwise set to None.

ReturnResult,Returner,BlockingPlayer,FieldGoalResult,FieldGoalDistance,RecFumbTeam,RecFumbPlayer,ChalReplayResult,PenalizedTeam,PenaltyType,PenalizedPlayer,Timeout_Team : Dropping these columns entirely as they have 90%+ missing values.

Tackler1,Tackler2: assign "None" to NA.

DefTeamScore,PosTeamScore,ScoreDiff,AbsScoreDiff: FFill (before and after values are consistently the same unless new match)

No_Score_Prob,Opp_Field_Goal_Prob,Opp_Safety_Prob,Opp_Touchdown_Prob,Field_Goal_Prob,Safety_Prob,Touchdown_Prob,EPA,Win_Prob: assign "0.0" to missing values

Away_WP_post,Away_WP_pre,Home_WP_post,Away_WP_post,WPA,airWPA,yacWPA: Mean

*"None" values are chosen instead of deletion due to the missing value being conditional and not a data gathering error. They are then to be encoded
n
Data from: Using multiple imputation to estimate missing data in...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Nov 25, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray (2015). Using multiple imputation to estimate missing data in meta-regression [Dataset]. http://doi.org/10.5061/dryad.m2v4m
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m2v4m
Dataset updated
Nov 25, 2015
Dataset provided by
Trent University
University of Prince Edward Island
Authors
E. Hance Ellington; Guillaume Bastille-Rousseau; Cayla Austin; Kristen N. Landolt; Bruce A. Pond; Erin E. Rees; Nicholas Robar; Dennis L. Murray
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
There is a growing need for scientific synthesis in ecology and evolution. In many cases, meta-analytic techniques can be used to complement such synthesis. However, missing data is a serious problem for any synthetic efforts and can compromise the integrity of meta-analyses in these and other disciplines. Currently, the prevalence of missing data in meta-analytic datasets in ecology and the efficacy of different remedies for this problem have not been adequately quantified. 2. We generated meta-analytic datasets based on literature reviews of experimental and observational data and found that missing data were prevalent in meta-analytic ecological datasets. We then tested the performance of complete case removal (a widely used method when data are missing) and multiple imputation (an alternative method for data recovery) and assessed model bias, precision, and multi-model rankings under a variety of simulated conditions using published meta-regression datasets. 3. We found that complete case removal led to biased and imprecise coefficient estimates and yielded poorly specified models. In contrast, multiple imputation provided unbiased parameter estimates with only a small loss in precision. The performance of multiple imputation, however, was dependent on the type of data missing. It performed best when missing values were weighting variables, but performance was mixed when missing values were predictor variables. Multiple imputation performed poorly when imputing raw data which was then used to calculate effect size and the weighting variable. 4. We conclude that complete case removal should not be used in meta-regression, and that multiple imputation has the potential to be an indispensable tool for meta-regression in ecology and evolution. However, we recommend that users assess the performance of multiple imputation by simulating missing data on a subset of their data before implementing it to recover actual missing data.
ICR - Identifying Age Related Conditions-Filtered
kaggle.com
zip
Updated May 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onkur7 (2023). ICR - Identifying Age Related Conditions-Filtered [Dataset]. https://www.kaggle.com/datasets/onkur7/icr-identifying-age-related-conditions-filtered
Explore at:
zip(1372977 bytes)Available download formats
Dataset updated
May 22, 2023
Authors
Onkur7
Description
The dataset is created by imputing the missing values of ICR - Identifying Age Related Conditions competition dataset. In this dataset depending on feature selection some subversions are also created. - Version 1 : The version is created by dropping all the rows with missing values. - Version 2 : The version is created by 'BQ' and 'EL' columns which consist most of the missing values. To remove the remaining missing values rows with missing values are deleted. - Version 3 : The version is created by imputing mean values by column average. Median is considered as measure of average. - Version 4 : The version is created by imputing missing values of 'BQ' and 'EL' by linear regression models and remaining missing values are imputed by average value of the column where missing value is present. 'AB', 'AF', 'AH', 'AM', 'CD', 'CF', 'DN', 'FL' and 'GL' are used to calculate the missing values of 'BQ'. 'CU', 'GE' and 'GL' are used to calculate missing values of 'EL'. Models are found in the version4/imputer. Two subversions are created by extraction only important features of the dataset. - Version 5 : The version is created by imputing missing values using KNNImputer. Two subversions are created by extracting only important features. For the categorical feature 'EJ', 'A' is encoded as 0 and 'B' is encoded as '1'. For more details how the transformations of the dataset is done visit this notebook.
Data from: Missing data handling methods
kaggle.com
zip
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krisztián Boros (2024). Missing data handling methods [Dataset]. https://www.kaggle.com/datasets/krisztinboros/missing-data-handling-methods
Explore at:
zip(6274510 bytes)Available download formats
Dataset updated
Jul 6, 2024
Authors
Krisztián Boros
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for the paper "Identifying missing data handling methods with text mining".
It contains the type of missing data handling method used by a given paper.

Column description

id: ID of the article
origin: Source journal
pub_year: Publication year
discipline: Discipline category of the article based on origin
about_missing: Is the article about missing data handling? (0 - no, 1 - yes)
imputation: Was some kind of imputation technique used in the article? (0 - no, 1 - yes)
advanced: Was some kind of advanced imputation technique used in the article? (0 - no, 1 - yes)
deletion: Was some kind of deletion technique used in the article? (0 - no, 1 - yes)
text_tokens: Snipped out parts from the original articles
f
R scripts used for Monte Carlo simulations and data analyses.
plos.figshare.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lateef Babatunde Amusa; Twinomurinzi Hossana (2024). R scripts used for Monte Carlo simulations and data analyses. [Dataset]. http://doi.org/10.1371/journal.pone.0297037.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0297037.s001
Dataset updated
Jan 19, 2024
Dataset provided by
PLOS ONE
Authors
Lateef Babatunde Amusa; Twinomurinzi Hossana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R scripts used for Monte Carlo simulations and data analyses.
f
Description of the dataset used in this study.
datasetcatalog.nlm.nih.gov
figshare.com
Updated Jan 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aljrees, Turki (2024). Description of the dataset used in this study. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001270720
Explore at:
Dataset updated
Jan 3, 2024
Authors
Aljrees, Turki
Description
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
Imputation missing values in the nominal datasets
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Awsan thabet salem (2023). Imputation missing values in the nominal datasets [Dataset]. https://www.kaggle.com/datasets/awsanthabetsalem/imputation-in-arabic-dataset/data
Explore at:
zip(16588335 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
Awsan thabet salem
Description
The folder contains three datasets: Zomato restaurants, Restaurants on Yellow Pages, and Arabic poetry. Where all datasets have been taken from Kaggle and made some modifications by adding missing values, where the missing values are referred to as symbol (?). The experiment has been done to experiment with the processes of imputation missing values on nominal values. The missing values in the three datasets are in the range of 10%-80%.

The Arabic dataset has several modifications as follows: 1. Delete the columns that contain English values such as Id, poem_link, poet link. The reason is the need to evaluate the ERAR method on the Arabic data set. 2. Add diacritical marks to some records to check the effect of diacritical marks during frequent itemset generation. note: the results of the experiment on the Arabic dataset will be find in the paper under the title "Missing values imputation in Arabic datasets using enhanced robust association rules"
Results of the ML models using PCA imputer.
plos.figshare.com
xls
Updated Jan 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turki Aljrees (2024). Results of the ML models using PCA imputer. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295632.t006
Dataset updated
Jan 3, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Turki Aljrees
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
Machine learning models.
plos.figshare.com
xls
Updated Jan 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turki Aljrees (2024). Machine learning models. [Dataset]. http://doi.org/10.1371/journal.pone.0295632.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295632.t002
Dataset updated
Jan 3, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Turki Aljrees
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cervical cancer is a leading cause of women’s mortality, emphasizing the need for early diagnosis and effective treatment. In line with the imperative of early intervention, the automated identification of cervical cancer has emerged as a promising avenue, leveraging machine learning techniques to enhance both the speed and accuracy of diagnosis. However, an inherent challenge in the development of these automated systems is the presence of missing values in the datasets commonly used for cervical cancer detection. Missing data can significantly impact the performance of machine learning models, potentially leading to inaccurate or unreliable results. This study addresses a critical challenge in automated cervical cancer identification—handling missing data in datasets. The study present a novel approach that combines three machine learning models into a stacked ensemble voting classifier, complemented by the use of a KNN Imputer to manage missing values. The proposed model achieves remarkable results with an accuracy of 0.9941, precision of 0.98, recall of 0.96, and an F1 score of 0.97. This study examines three distinct scenarios: one involving the deletion of missing values, another utilizing KNN imputation, and a third employing PCA for imputing missing values. This research has significant implications for the medical field, offering medical experts a powerful tool for more accurate cervical cancer therapy and enhancing the overall effectiveness of testing procedures. By addressing missing data challenges and achieving high accuracy, this work represents a valuable contribution to cervical cancer detection, ultimately aiming to reduce the impact of this disease on women’s health and healthcare systems.
d
Replication Data for: \"The Missing Dimension of the Political Resource...
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ranjit, Lall (2023). Replication Data for: \"The Missing Dimension of the Political Resource Curse Debate\" (Comparative Political Studies) [Dataset]. http://doi.org/10.7910/DVN/UHABC6
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UHABC6
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Ranjit, Lall
Description
Abstract: Given the methodological sophistication of the debate over the “political resource curse”—the purported negative relationship between natural resource wealth (in particular oil wealth) and democracy—it is surprising that scholars have not paid more attention to the basic statistical issue of how to deal with missing data. This article highlights the problems caused by the most common strategy for analyzing missing data in the political resource curse literature—listwise deletion—and investigates how addressing such problems through the best-practice technique of multiple imputation affects empirical results. I find that multiple imputation causes the results of a number of influential recent studies to converge on a key common finding: A political resource curse does exist, but only since the widespread nationalization of petroleum industries in the 1970s. This striking finding suggests that much of the controversy over the political resource curse has been caused by a neglect of missing-data issues.
Iris_with_missing_data
kaggle.com
zip
Updated May 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bharath Kumar Kathula (2023). Iris_with_missing_data [Dataset]. https://www.kaggle.com/datasets/bharathkumarkathula/iris-with-missing-data/code
Explore at:
zip(1319 bytes)Available download formats
Dataset updated
May 7, 2023
Authors
Bharath Kumar Kathula
Description
Iris dataset is taken from Kaggle https://www.kaggle.com/datasets/uciml/iris. Deleted some random values to create the CSV file
e
ComBat HarmonizR enables the integrated analysis of independently generated...
ebi.ac.uk
Updated May 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hannah Voß (2022). ComBat HarmonizR enables the integrated analysis of independently generated proteomic datasets through data harmonization with appropriate handling of missing values [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD027467
Explore at:
Dataset updated
May 23, 2022
Authors
Hannah Voß
Variables measured
Proteomics
Description
The integration of proteomic datasets, generated by non-cooperating laboratories using different LC-MS/MS setups can overcome limitations in statistically underpowered sample cohorts but has not been demonstrated to this day. In proteomics, differences in sample preservation and preparation strategies, chromatography and mass spectrometry approaches and the used quantification strategy distort protein abundance distributions in integrated datasets. The Removal of these technical batch effects requires setup-specific normalization and strategies that can deal with missing at random (MAR) and missing not at random (MNAR) type values at a time. Algorithms for batch effect removal, such as the ComBat-algorithm, commonly used for other omics types, disregard proteins with MNAR missing values and reduce the informational yield and the effect size for combined datasets significantly. Here, we present a strategy for data harmonization across different tissue preservation techniques, LC-MS/MS instrumentation setups and quantification approaches. To enable batch effect removal without the need for data reduction or error-prone imputation we developed an extension to the ComBat algorithm, ´ComBat HarmonizR, that performs data harmonization with appropriate handling of MAR and MNAR missing values by matrix dissection The ComBat HarmonizR based strategy enables the combined analysis of independently generated proteomic datasets for the first time. Furthermore, we found ComBat HarmonizR to be superior for removing batch effects between different Tandem Mass Tag (TMT)-plexes, compared to commonly used internal reference scaling (iRS). Due to the matrix dissection approach without the need of data imputation, the HarmonizR algorithm can be applied to any type of -omics data while assuring minimal data loss
First Entries Into Foster Care Reason For Removal
data.wu.ac.at
csv, json, xml
Updated Jun 3, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kidsdata.org, a program of the Lucile Packard Foundation for Children's Health (2015). First Entries Into Foster Care Reason For Removal [Dataset]. https://data.wu.ac.at/schema/performance_smcgov_org/amhzcy1pdXo1
Explore at:
json, csv, xmlAvailable download formats
Dataset updated
Jun 3, 2015
Dataset provided by
Lucile Packard Foundation for Children's Health
Description
Percentage of first entries into foster care for children under age 18, by removal reason (e.g., 8.5% of children entering foster care for the first time in California in 2011-2013 were removed from their families due to physical abuse). First entries into foster care are unduplicated counts of children under the supervision of county welfare departments and exclude cases under the supervision of county probation departments, out-of-state agencies, state adoptions district offices, and Indian child welfare departments. Counts are based on the first out-of-home placement of eight days or more, even if it was not the first actual placement. 'Other' includes removals due to exploitation, child’s disability or handicap, and other reasons. LNE (Low Number Event) refers to data that have been suppressed because there were fewer than 80 total children with first entries. N/A means that data are not available. The sum of all reasons for removal percentages may not add up to 100% due to missing values. Data Source: Needell, B., et al. (May 2014). Child Welfare Services Reports for California, U.C. Berkeley Center for Social Services Research. Retrieved on May 31, 2015.
f
CRAVE Confirmatory Factor Analysis w/ summed PW move and rest, and RN move...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Dec 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Stults-Kolehmainen, FACSM (2020). CRAVE Confirmatory Factor Analysis w/ summed PW move and rest, and RN move and rest scores. CLEANED.(Study 2).PCM12/02/2020 [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000523244
Explore at:
Dataset updated
Dec 2, 2020
Authors
Matthew Stults-Kolehmainen, FACSM
Description
CRAVE Confirmatory Factor Analysis w/ summed PW move and rest, and RN move and rest scores. CLEANED.(Study 2).All erroneous values and 999's (previously coded for missing values) have been deleted, leaving the value blank.PCM12/02/2020 Paul C. McKee
g
Bathymetry (Alaska and surrounding waters) | gimi9.com
gimi9.com
Updated Nov 19, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Bathymetry (Alaska and surrounding waters) | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_bathymetry-alaska-and-surrounding-waters2/
Explore at:
Dataset updated
Nov 19, 2017
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Alaska
Description
A Raster having 20 m resolution with decimal values was assembled from 18.6 billion bathymetric soundings that were obtained from the National Center for Environmental Information (NCEI) https://www.ncei.noaa.gov. Bathymetric soundings extends from Kuril-Kamchatka Trench in the Bearing Sea along the Aleutian Trench to the Gulf of Alaska, and in the Arctic Ocean from Prince Patrick Island to the International Date Line. Bathymetric soundings were scrutinized for accuracy using statistical analysis and visual inspection with some imputation. Editing processes included: deleting erroneous and superseded values, digitizing missing values, and referencing all data sets to a common, modern datum.
Overwatch 2 statistics
kaggle.com
zip
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics/code
Explore at:
zip(67546 bytes)Available download formats
Dataset updated
Jun 27, 2023
Authors
Mykhailo Kachan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

The code on GitHub .

All procedure is done in 5 stages:

Stage 1:

Data is retrieved directly from HTML elements on the page with the selenium tool on python.

Stage 2:

After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

Stage 3:

Data were arranged into a table and saved to CSV.

Stage 4:

Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

Stage 5:

Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

The procedure to fetch the data takes 7 minutes on average.

This project and code were born from this GitHub code.
d
Data: Prescribed fire enhances seed removal by ants in a Neotropical savanna...
datadryad.org
search.dataone.org
+1more
zip
Updated Nov 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirela Alcolea; Giselda Durigan; Alexander Christianini (2021). Data: Prescribed fire enhances seed removal by ants in a Neotropical savanna [Dataset]. http://doi.org/10.5061/dryad.tqjq2bw0t
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tqjq2bw0t
Dataset updated
Nov 6, 2021
Dataset provided by
Dryad
Authors
Mirela Alcolea; Giselda Durigan; Alexander Christianini
Time period covered
Oct 27, 2021
Description
There are no missing values. Please see Readme.txt file for more details about variables.

Facebook

Twitter

Click to copy link

Link copied

Cite

Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1

Identifying Missing Data Handling Methods with Text Mining

Explore at:

delimitedAvailable download formats

Unique identifier

https://doi.org/10.3886/E185961V1

Dataset updated

Mar 8, 2023

Dataset provided by

Hungarian Academy of Sciences

Authors

Krisztián Boros; Zoltán Kmetty

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 1, 1999 - Dec 31, 2016

Description

Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

Clear search

Close search

Google apps

Main menu

Identifying Missing Data Handling Methods with Text Mining

Understanding and Managing Missing Data.pdf

Results of the ML models were obtained by deleting missing values from the...

Removing missing values NFL Play by Play 2009-2017

Data from: Using multiple imputation to estimate missing data in...

ICR - Identifying Age Related Conditions-Filtered

Data from: Missing data handling methods

Column description

R scripts used for Monte Carlo simulations and data analyses.

Description of the dataset used in this study.

Imputation missing values in the nominal datasets

Results of the ML models using PCA imputer.

Machine learning models.

Replication Data for: \"The Missing Dimension of the Political Resource...

Iris_with_missing_data

ComBat HarmonizR enables the integrated analysis of independently generated...

First Entries Into Foster Care Reason For Removal

CRAVE Confirmatory Factor Analysis w/ summed PW move and rest, and RN move...

Bathymetry (Alaska and surrounding waters) | gimi9.com

Overwatch 2 statistics

Stage 1:

Stage 2:

Stage 3:

Stage 4:

Stage 5:

Data: Prescribed fire enhances seed removal by ants in a Neotropical savanna...

Identifying Missing Data Handling Methods with Text Mining