100+ datasets found

o
Identifying Missing Data Handling Methods with Text Mining
openicpsr.org
delimited
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E185961V1
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1999 - Dec 31, 2016
Description
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Finding_And_Visualizing_Missing_Data_Python
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Finding_And_Visualizing_Missing_Data_Python [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/finding-and-visualizing-missing-data-python
Explore at:
zip(371581 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
• This dataset is designed for learning how to identify missing data in Python.
• It focuses on techniques to detect null, NaN, and incomplete values.
• It includes examples of visualizing missing data patterns using Python libraries.
• Useful for beginners practicing data preprocessing and data cleaning.
• Helps users understand missing data handling methods for machine learning workflows.
• Supports practical exploration of datasets before model training.
Data from: Missing Data in the Uniform Crime Reports (UCR), 1977-2000...
catalog.data.gov
icpsr.umich.edu
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). Missing Data in the Uniform Crime Reports (UCR), 1977-2000 [United States] [Dataset]. https://catalog.data.gov/dataset/missing-data-in-the-uniform-crime-reports-ucr-1977-2000-united-states-4b340
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
National Institute of Justicehttp://nij.ojp.gov/
Area covered
United States
Description
This study reexamined and recoded missing data in the Uniform Crime Reports (UCR) for the years 1977 to 2000 for all police agencies in the United States. The principal investigator conducted a data cleaning of 20,067 Originating Agency Identifiers (ORIs) contained within the Offenses-Known UCR data from 1977 to 2000. Data cleaning involved performing agency name checks and creating new numerical codes for different types of missing data including missing data codes that identify whether a record was aggregated to a particular month, whether no data were reported (true missing), if more than one index crime was missing, if a particular index crime (motor vehicle theft, larceny, burglary, assault, robbery, rape, murder) was missing, researcher assigned missing value codes according to the "rule of 20", outlier values, whether an ORI was covered by another agency, and whether an agency did not exist during a particular time period.
Retail Product Dataset with Missing Values
kaggle.com
zip
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
Explore at:
zip(47826 bytes)Available download formats
Dataset updated
Feb 17, 2025
Authors
Himel Sarder
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage

This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Statistical Methods for Missing Data in Large Observational Studies [Methods...
icpsr.umich.edu
Updated Oct 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Long, Qi (2025). Statistical Methods for Missing Data in Large Observational Studies [Methods Study], Georgia, 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39526.v1
Explore at:
Unique identifier
https://doi.org/10.3886/ICPSR39526.v1
Dataset updated
Oct 27, 2025
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Long, Qi
License
https://www.icpsr.umich.edu/web/ICPSR/studies/39526/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39526/terms
Time period covered
2013 - 2018
Area covered
Georgia, United States
Description
Health registries record data about patients with a specific health problem. These data may include age, weight, blood pressure, health problems, medical test results, and treatments received. But data in some patient records may be missing. For example, some patients may not report their weight or all of their health problems. Research studies can use data from health registries to learn how well treatments work. But missing data can lead to incorrect results. To address the problem, researchers often exclude patient records with missing data from their studies. But doing this can also lead to incorrect results. The fewer records that researchers use, the greater the chance for incorrect results. Missing data also lead to another problem: it is harder for researchers to find patient traits that could affect diagnosis and treatment. For example, patients who are overweight may get heart disease. But if data are missing, it is hard for researchers to be sure that trait could affect diagnosis and treatment of heart disease. In this study, the research team developed new statistical methods to fill in missing data in large studies. The team also developed methods to use when data are missing to help find patient traits that could affect diagnosis and treatment. To access the methods, software, and R package, please visit the Long Research Group website.
Data from: Evaluating Supplemental Samples in Longitudinal Research:...
tandf.figshare.com
txt
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura K. Taylor; Xin Tong; Scott E. Maxwell (2024). Evaluating Supplemental Samples in Longitudinal Research: Replacement and Refreshment Approaches [Dataset]. http://doi.org/10.6084/m9.figshare.12162072.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12162072.v1
Dataset updated
Feb 9, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Laura K. Taylor; Xin Tong; Scott E. Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the wide application of longitudinal studies, they are often plagued by missing data and attrition. The majority of methodological approaches focus on participant retention or modern missing data analysis procedures. This paper, however, takes a new approach by examining how researchers may supplement the sample with additional participants. First, refreshment samples use the same selection criteria as the initial study. Second, replacement samples identify auxiliary variables that may help explain patterns of missingness and select new participants based on those characteristics. A simulation study compares these two strategies for a linear growth model with five measurement occasions. Overall, the results suggest that refreshment samples lead to less relative bias, greater relative efficiency, and more acceptable coverage rates than replacement samples or not supplementing the missing participants in any way. Refreshment samples also have high statistical power. The comparative strengths of the refreshment approach are further illustrated through a real data example. These findings have implications for assessing change over time when researching at-risk samples with high levels of permanent attrition.
f
Comparison of missing values, ‘don’t know’ values and inconsistent values...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 21, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Hal, Guido; Van der Heyden, Johan; Braekman, Elise; Charafeddine, Rana; Demarest, Stefaan; Gisle, Lydia; Tafforeau, Jean; Berete, Finaba; Molenberghs, Geert; Drieskens, Sabine (2018). Comparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000729296
Explore at:
Dataset updated
May 21, 2018
Authors
Van Hal, Guido; Van der Heyden, Johan; Braekman, Elise; Charafeddine, Rana; Demarest, Stefaan; Gisle, Lydia; Tafforeau, Jean; Berete, Finaba; Molenberghs, Geert; Drieskens, Sabine
Description
Comparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149).
Datasheet3_Assessing disparities through missing race and ethnicity data:...
frontiersin.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Jul 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan (2024). Datasheet3_Assessing disparities through missing race and ethnicity data: results from a juvenile arthritis registry.pdf [Dataset]. http://doi.org/10.3389/fped.2024.1430981.s003
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fped.2024.1430981.s003
Dataset updated
Jul 24, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionEnsuring high-quality race and ethnicity data within the electronic health record (EHR) and across linked systems, such as patient registries, is necessary to achieving the goal of inclusion of racial and ethnic minorities in scientific research and detecting disparities associated with race and ethnicity. The project goal was to improve race and ethnicity data completion within the Pediatric Rheumatology Care Outcomes Improvement Network and assess impact of improved data completion on conclusions drawn from the registry.MethodsThis is a mixed-methods quality improvement study that consisted of five parts, as follows: (1) Identifying baseline missing race and ethnicity data, (2) Surveying current collection and entry, (3) Completing data through audit and feedback cycles, (4) Assessing the impact on outcome measures, and (5) Conducting participant interviews and thematic analysis.ResultsAcross six participating centers, 29% of the patients were missing data on race and 31% were missing data on ethnicity. Of patients missing data, most patients were missing both race and ethnicity. Rates of missingness varied by data entry method (electronic vs. manual). Recovered data had a higher percentage of patients with Other race or Hispanic/Latino ethnicity compared with patients with non-missing race and ethnicity data at baseline. Black patients had a significantly higher odds ratio of having a clinical juvenile arthritis disease activity score (cJADAS10) of ≥5 at first follow-up compared with White patients. There was no significant change in odds ratio of cJADAS10 ≥5 for race and ethnicity after data completion. Patients missing race and ethnicity were more likely to be missing cJADAS values, which may affect the ability to detect changes in odds ratio of cJADAS ≥5 after completion.ConclusionsAbout one-third of the patients in a pediatric rheumatology registry were missing race and ethnicity data. After three audit and feedback cycles, centers decreased missing data by 94%, primarily via data recovery from the EHR. In this sample, completion of missing data did not change the findings related to differential outcomes by race. Recovered data were not uniformly distributed compared with those with non-missing race and ethnicity data at baseline, suggesting that differences in outcomes after completing race and ethnicity data may be seen with larger sample sizes.
Sensitivity Analysis Tools for Clinical Trials with Missing Data [Methods...
icpsr.umich.edu
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scharfstein, Daniel O. (2025). Sensitivity Analysis Tools for Clinical Trials with Missing Data [Methods Study], 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39492.v1
Explore at:
Unique identifier
https://doi.org/10.3886/ICPSR39492.v1
Dataset updated
Sep 15, 2025
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
Scharfstein, Daniel O.
License
https://www.icpsr.umich.edu/web/ICPSR/studies/39492/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39492/terms
Time period covered
2013 - 2018
Area covered
United States
Description
Clinical trials study the effects of medical treatments, like how safe they are and how well they work. But most clinical trials don't get all the data they need from patients. Patients may not answer all questions on a survey, or they may drop out of a study after it has started. The missing data can affect researchers' ability to detect the effects of treatments. To address the problem of missing data, researchers can make different guesses based on why and how data are missing. Then they can look at results for each guess. If results based on different guesses are similar, researchers can have more confidence that the study results are accurate. In this study, the research team created new methods to do these tests and developed software that runs these tests. To access the sensitivity analysis methods and software, please visit the MissingDataMatters website.
Imputation missing values in the nominal datasets
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Awsan thabet salem (2023). Imputation missing values in the nominal datasets [Dataset]. https://www.kaggle.com/datasets/awsanthabetsalem/imputation-in-arabic-dataset/data
Explore at:
zip(16588335 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
Awsan thabet salem
Description
The folder contains three datasets: Zomato restaurants, Restaurants on Yellow Pages, and Arabic poetry. Where all datasets have been taken from Kaggle and made some modifications by adding missing values, where the missing values are referred to as symbol (?). The experiment has been done to experiment with the processes of imputation missing values on nominal values. The missing values in the three datasets are in the range of 10%-80%.

The Arabic dataset has several modifications as follows: 1. Delete the columns that contain English values such as Id, poem_link, poet link. The reason is the need to evaluate the ERAR method on the Arabic data set. 2. Add diacritical marks to some records to check the effect of diacritical marks during frequent itemset generation. note: the results of the experiment on the Arabic dataset will be find in the paper under the title "Missing values imputation in Arabic datasets using enhanced robust association rules"
Sales Dataset v2 for Marketing Analytics
kaggle.com
zip
Updated Jun 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emmanuel DJEGOU (2022). Sales Dataset v2 for Marketing Analytics [Dataset]. https://www.kaggle.com/datasets/emmanueldjegou/sales-dataset-enlarged
Explore at:
zip(1114 bytes)Available download formats
Dataset updated
Jun 26, 2022
Authors
Emmanuel DJEGOU
Description
Looking painstakingly at the dataset, it's noticeable that some inconsistencies are messing up our data. In fact, the columns Product and line should count for a sigle attribut. Then, the actual observation should be Camping Equipment. Similarily, columns such as Retailer and country, are undergoing the same issue. In addition, the values of the rows regarding the attributs order and method do not convey any relevant information. Consequently, some supplemental work need to be done in the analysis.
d
Morpho missing data? 1
dune.com
Updated Nov 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vegarsti (2025). Morpho missing data? 1 [Dataset]. https://dune.com/discover/content/relevant?resource-type=queries&q=code%3A%22morpho_blue_multichain.morphoblue_evt_supply%22
Explore at:
Dataset updated
Nov 4, 2025
Authors
vegarsti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Blockchain data query: Morpho missing data? 1
d
Morpho missing data? 2
dune.com
Updated Nov 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vegarsti (2025). Morpho missing data? 2 [Dataset]. https://dune.com/discover/content/relevant?resource-type=queries&q=code%3A%22morpho_blue_multichain.morphoblue_evt_supply%22
Explore at:
Dataset updated
Nov 4, 2025
Authors
vegarsti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Blockchain data query: Morpho missing data? 2
Used car dataset for data cleaning practice
kaggle.com
zip
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peachji (2024). Used car dataset for data cleaning practice [Dataset]. https://www.kaggle.com/datasets/peachji/car-dataset-for-data-cleaning-practice/code
Explore at:
zip(245562 bytes)Available download formats
Dataset updated
Feb 7, 2024
Authors
Peachji
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Used car dataset 🚗

Due to the expanding used car market, sellers need to be aware of the variables affecting vehicle values. It is essential to comprehend these effects, given the plethora of factors. This information can be examined to gain insights by looking through this used car pricing dataset. Business question : To investigate potential factors influencing used car prices

Task

Before gaining insights from the data, it's crucial to carefully identify and address missing values, employing the most effective methods for imputation.
D
Data from: Using decision trees to understand structure in missing data
datasetcatalog.nlm.nih.gov
search.dataone.org
+2more
Updated Jun 2, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mengersen, Kerrie L.; Tierney, Nicholas J.; Harden, Fiona A.; Harden, Maurice J. (2015). Using decision trees to understand structure in missing data [Dataset]. http://doi.org/10.5061/dryad.j4f19
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.j4f19
Dataset updated
Jun 2, 2015
Authors
Mengersen, Kerrie L.; Tierney, Nicholas J.; Harden, Fiona A.; Harden, Maurice J.
Description
Objectives: Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data. Setting: Data taken from employees at 3 different industrial sites in Australia. Participants: 7915 observations were included. Materials and methods: The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. Results: CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. Discussion: Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Conclusions: Researchers are encouraged to use CART and BRT models to explore and understand missing data.
f
Effect of missing data on topological inference using a total evidence...
figshare.com
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Guillerme; Natalie Cooper (2023). Effect of missing data on topological inference using a total evidence approach [Dataset]. http://doi.org/10.6084/m9.figshare.1306861.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1306861.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Thomas Guillerme; Natalie Cooper
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To fully understand macroevolutionary patterns and processes, we need to include both extant and extinct species in our models. This requires phylogenetic trees with both living and fossil taxa at the tips. One way to infer such phylogenies is the Total Evidence approach which uses molecular data from living taxa and morphological data from living and fossil taxa.Although the Total Evidence approach is very promising, it requires a great deal of data that can be hard to collect. Therefore this method is likely to suffer from missing data issues that may affect its ability to infer correct phylogenies.Here we use simulations to assess the effects of missing data on tree topologies inferred from Total Evidence matrices. We investigate three major factors that directly affect the completeness and the size of the morphological part of the matrix: the proportion of living taxa with no morphological data, the amount of missing data in the fossil record, and the overall number of morphological characters in the matrix. We infer phylogenies from complete matrices and from matrices with various amounts of missing data, and then compare missing data topologies to the "best" tree topology inferred using the complete matrix.We find that the number of living taxa with morphological characters and the overall number of morphological characters in the matrix, are more important than the amount of missing data in the fossil record for recovering the "best" tree topology. Therefore, we suggest that sampling effort should be focused on morphological data collection for living species to increase the accuracy of topological inference in a Total Evidence framework. Additionally, we find that Bayesian methods consistently outperform other tree inference methods. We therefore recommend using Bayesian consensus trees to fix the tree topology prior to further analyses.
d
Fantom.traces Missing Data Example
dune.com
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cryptuschrist (2023). Fantom.traces Missing Data Example [Dataset]. https://dune.com/discover/content/relevant?resource-type=queries&q=code%3A%22fantom.traces%22
Explore at:
Dataset updated
Jun 4, 2023
Authors
cryptuschrist
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Blockchain data query: Fantom.traces Missing Data Example
f
Data from: Sparse Functional Boxplots for Multivariate Curves
tandf.figshare.com
datasetcatalog.nlm.nih.gov
bin
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhuo Qu; Marc G. Genton (2023). Sparse Functional Boxplots for Multivariate Curves [Dataset]. http://doi.org/10.6084/m9.figshare.19617397.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19617397.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Zhuo Qu; Marc G. Genton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper introduces the sparse functional boxplot and the intensity sparse functional boxplot as practical exploratory tools. Besides being available for complete functional data, they can be used in sparse univariate and multivariate functional data. The sparse functional boxplot, based on the functional boxplot, displays sparseness proportions within the 50% central region. The intensity sparse functional boxplot indicates the relative intensity of fitted sparse point patterns in the central region. The two-stage functional boxplot, which derives from the functional boxplot to detect outliers, is furthermore extended to its sparse form. We also contribute to sparse data fitting improvement and sparse multivariate functional data depth. In a simulation study, we evaluate the goodness of data fitting, several depth proposals for sparse multivariate functional data, and compare the results of outlier detection between the sparse functional boxplot and its two-stage version. The practical applications of the sparse functional boxplot and intensity sparse functional boxplot are illustrated with two public health datasets. Supplementary materials and codes are available for readers to apply our visualization tools and replicate the analysis.
d
Morpho missing data?
dune.com
Updated Nov 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vegarsti (2025). Morpho missing data? [Dataset]. https://dune.com/discover/content/relevant?resource-type=queries&q=code%3A%22morpho_blue_multichain.morphoblue_evt_supply%22
Explore at:
Dataset updated
Nov 4, 2025
Authors
vegarsti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Blockchain data query: Morpho missing data?
Fix the Gaps: Data Hospital Simulation
kaggle.com
zip
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajarajeswari P (2025). Fix the Gaps: Data Hospital Simulation [Dataset]. https://www.kaggle.com/datasets/rajarajeswariprr/fix-the-gaps-data-hospital-simulation
Explore at:
zip(24673 bytes)Available download formats
Dataset updated
Nov 25, 2025
Authors
Rajarajeswari P
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Activity Title: "Fix the Gaps: Data Hospital Simulation" (This activity is created for students to practice techniques to handle missing data)

Description: Provide each team with a “broken patient record” dataset (incomplete entries with NaNs or blanks). Teams act as data doctors: • Diagnose the type of missingness (MCAR, MAR, MNAR) • Choose suitable imputation techniques (mean, median, KNN, regression) • Compare outcomes from different methods

Tools: Jupyter notebook / Pandas

Outcome: Group presentation on the impact of imputation and justification of the method used.

Facebook

Twitter

Click to copy link

Link copied

Cite

Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1

Identifying Missing Data Handling Methods with Text Mining

Explore at:

delimitedAvailable download formats

Unique identifier

https://doi.org/10.3886/E185961V1

Dataset updated

Mar 8, 2023

Dataset provided by

Hungarian Academy of Sciences

Authors

Krisztián Boros; Zoltán Kmetty

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 1, 1999 - Dec 31, 2016

Description

Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

Clear search

Close search

Google apps

Main menu

Identifying Missing Data Handling Methods with Text Mining

Finding_And_Visualizing_Missing_Data_Python

Data from: Missing Data in the Uniform Crime Reports (UCR), 1977-2000...

Retail Product Dataset with Missing Values

Statistical Methods for Missing Data in Large Observational Studies [Methods...

Data from: Evaluating Supplemental Samples in Longitudinal Research:...

Comparison of missing values, ‘don’t know’ values and inconsistent values...

Datasheet3_Assessing disparities through missing race and ethnicity data:...

Sensitivity Analysis Tools for Clinical Trials with Missing Data [Methods...

Imputation missing values in the nominal datasets

Sales Dataset v2 for Marketing Analytics

Morpho missing data? 1

Morpho missing data? 2

Used car dataset for data cleaning practice

Used car dataset 🚗

Task

Data from: Using decision trees to understand structure in missing data

Effect of missing data on topological inference using a total evidence...

Fantom.traces Missing Data Example

Data from: Sparse Functional Boxplots for Multivariate Curves

Morpho missing data?

Fix the Gaps: Data Hospital Simulation

Identifying Missing Data Handling Methods with Text Mining