100+ datasets found
  1. o

    Identifying Missing Data Handling Methods with Text Mining

    • openicpsr.org
    delimited
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Hungarian Academy of Sciences
    Authors
    Krisztián Boros; Zoltán Kmetty
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1999 - Dec 31, 2016
    Description

    Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

  2. Finding_And_Visualizing_Missing_Data_Python

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). Finding_And_Visualizing_Missing_Data_Python [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/finding-and-visualizing-missing-data-python
    Explore at:
    zip(371581 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    • This dataset is designed for learning how to identify missing data in Python.
    • It focuses on techniques to detect null, NaN, and incomplete values.
    • It includes examples of visualizing missing data patterns using Python libraries.
    • Useful for beginners practicing data preprocessing and data cleaning.
    • Helps users understand missing data handling methods for machine learning workflows.
    • Supports practical exploration of datasets before model training.

  3. Data from: Missing Data in the Uniform Crime Reports (UCR), 1977-2000...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Missing Data in the Uniform Crime Reports (UCR), 1977-2000 [United States] [Dataset]. https://catalog.data.gov/dataset/missing-data-in-the-uniform-crime-reports-ucr-1977-2000-united-states-4b340
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    National Institute of Justicehttp://nij.ojp.gov/
    Area covered
    United States
    Description

    This study reexamined and recoded missing data in the Uniform Crime Reports (UCR) for the years 1977 to 2000 for all police agencies in the United States. The principal investigator conducted a data cleaning of 20,067 Originating Agency Identifiers (ORIs) contained within the Offenses-Known UCR data from 1977 to 2000. Data cleaning involved performing agency name checks and creating new numerical codes for different types of missing data including missing data codes that identify whether a record was aggregated to a particular month, whether no data were reported (true missing), if more than one index crime was missing, if a particular index crime (motor vehicle theft, larceny, burglary, assault, robbery, rape, murder) was missing, researcher assigned missing value codes according to the "rule of 20", outlier values, whether an ORI was covered by another agency, and whether an agency did not exist during a particular time period.

  4. Retail Product Dataset with Missing Values

    • kaggle.com
    zip
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
    Explore at:
    zip(47826 bytes)Available download formats
    Dataset updated
    Feb 17, 2025
    Authors
    Himel Sarder
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

    The dataset includes:
    - Category (Categorical): Product category (A, B, C, D)
    - Price (Numerical): Randomized product prices
    - Rating (Numerical): Ratings between 1 to 5
    - Stock (Categorical): Availability status (In Stock, Out of Stock)
    - Discount (Numerical): Discount percentage

    This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.

  5. Statistical Methods for Missing Data in Large Observational Studies [Methods...

    • icpsr.umich.edu
    Updated Oct 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Long, Qi (2025). Statistical Methods for Missing Data in Large Observational Studies [Methods Study], Georgia, 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39526.v1
    Explore at:
    Dataset updated
    Oct 27, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Long, Qi
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39526/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39526/terms

    Time period covered
    2013 - 2018
    Area covered
    Georgia, United States
    Description

    Health registries record data about patients with a specific health problem. These data may include age, weight, blood pressure, health problems, medical test results, and treatments received. But data in some patient records may be missing. For example, some patients may not report their weight or all of their health problems. Research studies can use data from health registries to learn how well treatments work. But missing data can lead to incorrect results. To address the problem, researchers often exclude patient records with missing data from their studies. But doing this can also lead to incorrect results. The fewer records that researchers use, the greater the chance for incorrect results. Missing data also lead to another problem: it is harder for researchers to find patient traits that could affect diagnosis and treatment. For example, patients who are overweight may get heart disease. But if data are missing, it is hard for researchers to be sure that trait could affect diagnosis and treatment of heart disease. In this study, the research team developed new statistical methods to fill in missing data in large studies. The team also developed methods to use when data are missing to help find patient traits that could affect diagnosis and treatment. To access the methods, software, and R package, please visit the Long Research Group website.

  6. Data from: Evaluating Supplemental Samples in Longitudinal Research:...

    • tandf.figshare.com
    txt
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura K. Taylor; Xin Tong; Scott E. Maxwell (2024). Evaluating Supplemental Samples in Longitudinal Research: Replacement and Refreshment Approaches [Dataset]. http://doi.org/10.6084/m9.figshare.12162072.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 9, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Laura K. Taylor; Xin Tong; Scott E. Maxwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite the wide application of longitudinal studies, they are often plagued by missing data and attrition. The majority of methodological approaches focus on participant retention or modern missing data analysis procedures. This paper, however, takes a new approach by examining how researchers may supplement the sample with additional participants. First, refreshment samples use the same selection criteria as the initial study. Second, replacement samples identify auxiliary variables that may help explain patterns of missingness and select new participants based on those characteristics. A simulation study compares these two strategies for a linear growth model with five measurement occasions. Overall, the results suggest that refreshment samples lead to less relative bias, greater relative efficiency, and more acceptable coverage rates than replacement samples or not supplementing the missing participants in any way. Refreshment samples also have high statistical power. The comparative strengths of the refreshment approach are further illustrated through a real data example. These findings have implications for assessing change over time when researching at-risk samples with high levels of permanent attrition.

  7. f

    Comparison of missing values, ‘don’t know’ values and inconsistent values...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Hal, Guido; Van der Heyden, Johan; Braekman, Elise; Charafeddine, Rana; Demarest, Stefaan; Gisle, Lydia; Tafforeau, Jean; Berete, Finaba; Molenberghs, Geert; Drieskens, Sabine (2018). Comparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000729296
    Explore at:
    Dataset updated
    May 21, 2018
    Authors
    Van Hal, Guido; Van der Heyden, Johan; Braekman, Elise; Charafeddine, Rana; Demarest, Stefaan; Gisle, Lydia; Tafforeau, Jean; Berete, Finaba; Molenberghs, Geert; Drieskens, Sabine
    Description

    Comparison of missing values, ‘don’t know’ values and inconsistent values between the paper-and-pencil and web-based mode and number of data entry mistakes in the paper-and-pencil mode (n = 149).

  8. Datasheet3_Assessing disparities through missing race and ethnicity data:...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Jul 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan (2024). Datasheet3_Assessing disparities through missing race and ethnicity data: results from a juvenile arthritis registry.pdf [Dataset]. http://doi.org/10.3389/fped.2024.1430981.s003
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Katelyn M. Banschbach; Jade Singleton; Xing Wang; Sheetal S. Vora; Julia G. Harris; Ashley Lytch; Nancy Pan; Julia Klauss; Danielle Fair; Erin Hammelev; Mileka Gilbert; Connor Kreese; Ashley Machado; Peter Tarczy-Hornoch; Esi M. Morgan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionEnsuring high-quality race and ethnicity data within the electronic health record (EHR) and across linked systems, such as patient registries, is necessary to achieving the goal of inclusion of racial and ethnic minorities in scientific research and detecting disparities associated with race and ethnicity. The project goal was to improve race and ethnicity data completion within the Pediatric Rheumatology Care Outcomes Improvement Network and assess impact of improved data completion on conclusions drawn from the registry.MethodsThis is a mixed-methods quality improvement study that consisted of five parts, as follows: (1) Identifying baseline missing race and ethnicity data, (2) Surveying current collection and entry, (3) Completing data through audit and feedback cycles, (4) Assessing the impact on outcome measures, and (5) Conducting participant interviews and thematic analysis.ResultsAcross six participating centers, 29% of the patients were missing data on race and 31% were missing data on ethnicity. Of patients missing data, most patients were missing both race and ethnicity. Rates of missingness varied by data entry method (electronic vs. manual). Recovered data had a higher percentage of patients with Other race or Hispanic/Latino ethnicity compared with patients with non-missing race and ethnicity data at baseline. Black patients had a significantly higher odds ratio of having a clinical juvenile arthritis disease activity score (cJADAS10) of ≥5 at first follow-up compared with White patients. There was no significant change in odds ratio of cJADAS10 ≥5 for race and ethnicity after data completion. Patients missing race and ethnicity were more likely to be missing cJADAS values, which may affect the ability to detect changes in odds ratio of cJADAS ≥5 after completion.ConclusionsAbout one-third of the patients in a pediatric rheumatology registry were missing race and ethnicity data. After three audit and feedback cycles, centers decreased missing data by 94%, primarily via data recovery from the EHR. In this sample, completion of missing data did not change the findings related to differential outcomes by race. Recovered data were not uniformly distributed compared with those with non-missing race and ethnicity data at baseline, suggesting that differences in outcomes after completing race and ethnicity data may be seen with larger sample sizes.

  9. Sensitivity Analysis Tools for Clinical Trials with Missing Data [Methods...

    • icpsr.umich.edu
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scharfstein, Daniel O. (2025). Sensitivity Analysis Tools for Clinical Trials with Missing Data [Methods Study], 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39492.v1
    Explore at:
    Dataset updated
    Sep 15, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Scharfstein, Daniel O.
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39492/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39492/terms

    Time period covered
    2013 - 2018
    Area covered
    United States
    Description

    Clinical trials study the effects of medical treatments, like how safe they are and how well they work. But most clinical trials don't get all the data they need from patients. Patients may not answer all questions on a survey, or they may drop out of a study after it has started. The missing data can affect researchers' ability to detect the effects of treatments. To address the problem of missing data, researchers can make different guesses based on why and how data are missing. Then they can look at results for each guess. If results based on different guesses are similar, researchers can have more confidence that the study results are accurate. In this study, the research team created new methods to do these tests and developed software that runs these tests. To access the sensitivity analysis methods and software, please visit the MissingDataMatters website.

  10. Imputation missing values in the nominal datasets

    • kaggle.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Awsan thabet salem (2023). Imputation missing values in the nominal datasets [Dataset]. https://www.kaggle.com/datasets/awsanthabetsalem/imputation-in-arabic-dataset/data
    Explore at:
    zip(16588335 bytes)Available download formats
    Dataset updated
    Jan 29, 2023
    Authors
    Awsan thabet salem
    Description

    The folder contains three datasets: Zomato restaurants, Restaurants on Yellow Pages, and Arabic poetry. Where all datasets have been taken from Kaggle and made some modifications by adding missing values, where the missing values are referred to as symbol (?). The experiment has been done to experiment with the processes of imputation missing values on nominal values. The missing values in the three datasets are in the range of 10%-80%.

    The Arabic dataset has several modifications as follows: 1. Delete the columns that contain English values such as Id, poem_link, poet link. The reason is the need to evaluate the ERAR method on the Arabic data set. 2. Add diacritical marks to some records to check the effect of diacritical marks during frequent itemset generation. note: the results of the experiment on the Arabic dataset will be find in the paper under the title "Missing values imputation in Arabic datasets using enhanced robust association rules"

  11. Sales Dataset v2 for Marketing Analytics

    • kaggle.com
    zip
    Updated Jun 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emmanuel DJEGOU (2022). Sales Dataset v2 for Marketing Analytics [Dataset]. https://www.kaggle.com/datasets/emmanueldjegou/sales-dataset-enlarged
    Explore at:
    zip(1114 bytes)Available download formats
    Dataset updated
    Jun 26, 2022
    Authors
    Emmanuel DJEGOU
    Description

    Looking painstakingly at the dataset, it's noticeable that some inconsistencies are messing up our data. In fact, the columns Product and line should count for a sigle attribut. Then, the actual observation should be Camping Equipment. Similarily, columns such as Retailer and country, are undergoing the same issue. In addition, the values of the rows regarding the attributs order and method do not convey any relevant information. Consequently, some supplemental work need to be done in the analysis.

  12. d

    Morpho missing data? 1

    • dune.com
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vegarsti (2025). Morpho missing data? 1 [Dataset]. https://dune.com/discover/content/relevant?resource-type=queries&q=code%3A%22morpho_blue_multichain.morphoblue_evt_supply%22
    Explore at:
    Dataset updated
    Nov 4, 2025
    Authors
    vegarsti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Blockchain data query: Morpho missing data? 1

  13. d

    Morpho missing data? 2

    • dune.com
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vegarsti (2025). Morpho missing data? 2 [Dataset]. https://dune.com/discover/content/relevant?resource-type=queries&q=code%3A%22morpho_blue_multichain.morphoblue_evt_supply%22
    Explore at:
    Dataset updated
    Nov 4, 2025
    Authors
    vegarsti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Blockchain data query: Morpho missing data? 2

  14. Used car dataset for data cleaning practice

    • kaggle.com
    zip
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peachji (2024). Used car dataset for data cleaning practice [Dataset]. https://www.kaggle.com/datasets/peachji/car-dataset-for-data-cleaning-practice/code
    Explore at:
    zip(245562 bytes)Available download formats
    Dataset updated
    Feb 7, 2024
    Authors
    Peachji
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    Used car dataset 🚗

    Due to the expanding used car market, sellers need to be aware of the variables affecting vehicle values. It is essential to comprehend these effects, given the plethora of factors. This information can be examined to gain insights by looking through this used car pricing dataset. Business question : To investigate potential factors influencing used car prices

    Task

    Before gaining insights from the data, it's crucial to carefully identify and address missing values, employing the most effective methods for imputation.

  15. D

    Data from: Using decision trees to understand structure in missing data

    • datasetcatalog.nlm.nih.gov
    • search.dataone.org
    • +2more
    Updated Jun 2, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengersen, Kerrie L.; Tierney, Nicholas J.; Harden, Fiona A.; Harden, Maurice J. (2015). Using decision trees to understand structure in missing data [Dataset]. http://doi.org/10.5061/dryad.j4f19
    Explore at:
    Dataset updated
    Jun 2, 2015
    Authors
    Mengersen, Kerrie L.; Tierney, Nicholas J.; Harden, Fiona A.; Harden, Maurice J.
    Description

    Objectives: Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data. Setting: Data taken from employees at 3 different industrial sites in Australia. Participants: 7915 observations were included. Materials and methods: The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. Results: CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. Discussion: Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Conclusions: Researchers are encouraged to use CART and BRT models to explore and understand missing data.

  16. f

    Effect of missing data on topological inference using a total evidence...

    • figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Guillerme; Natalie Cooper (2023). Effect of missing data on topological inference using a total evidence approach [Dataset]. http://doi.org/10.6084/m9.figshare.1306861.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Thomas Guillerme; Natalie Cooper
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To fully understand macroevolutionary patterns and processes, we need to include both extant and extinct species in our models. This requires phylogenetic trees with both living and fossil taxa at the tips. One way to infer such phylogenies is the Total Evidence approach which uses molecular data from living taxa and morphological data from living and fossil taxa.Although the Total Evidence approach is very promising, it requires a great deal of data that can be hard to collect. Therefore this method is likely to suffer from missing data issues that may affect its ability to infer correct phylogenies.Here we use simulations to assess the effects of missing data on tree topologies inferred from Total Evidence matrices. We investigate three major factors that directly affect the completeness and the size of the morphological part of the matrix: the proportion of living taxa with no morphological data, the amount of missing data in the fossil record, and the overall number of morphological characters in the matrix. We infer phylogenies from complete matrices and from matrices with various amounts of missing data, and then compare missing data topologies to the "best" tree topology inferred using the complete matrix.We find that the number of living taxa with morphological characters and the overall number of morphological characters in the matrix, are more important than the amount of missing data in the fossil record for recovering the "best" tree topology. Therefore, we suggest that sampling effort should be focused on morphological data collection for living species to increase the accuracy of topological inference in a Total Evidence framework. Additionally, we find that Bayesian methods consistently outperform other tree inference methods. We therefore recommend using Bayesian consensus trees to fix the tree topology prior to further analyses.

  17. d

    Fantom.traces Missing Data Example

    • dune.com
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cryptuschrist (2023). Fantom.traces Missing Data Example [Dataset]. https://dune.com/discover/content/relevant?resource-type=queries&q=code%3A%22fantom.traces%22
    Explore at:
    Dataset updated
    Jun 4, 2023
    Authors
    cryptuschrist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Blockchain data query: Fantom.traces Missing Data Example

  18. f

    Data from: Sparse Functional Boxplots for Multivariate Curves

    • tandf.figshare.com
    • datasetcatalog.nlm.nih.gov
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhuo Qu; Marc G. Genton (2023). Sparse Functional Boxplots for Multivariate Curves [Dataset]. http://doi.org/10.6084/m9.figshare.19617397.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Zhuo Qu; Marc G. Genton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper introduces the sparse functional boxplot and the intensity sparse functional boxplot as practical exploratory tools. Besides being available for complete functional data, they can be used in sparse univariate and multivariate functional data. The sparse functional boxplot, based on the functional boxplot, displays sparseness proportions within the 50% central region. The intensity sparse functional boxplot indicates the relative intensity of fitted sparse point patterns in the central region. The two-stage functional boxplot, which derives from the functional boxplot to detect outliers, is furthermore extended to its sparse form. We also contribute to sparse data fitting improvement and sparse multivariate functional data depth. In a simulation study, we evaluate the goodness of data fitting, several depth proposals for sparse multivariate functional data, and compare the results of outlier detection between the sparse functional boxplot and its two-stage version. The practical applications of the sparse functional boxplot and intensity sparse functional boxplot are illustrated with two public health datasets. Supplementary materials and codes are available for readers to apply our visualization tools and replicate the analysis.

  19. d

    Morpho missing data?

    • dune.com
    Updated Nov 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vegarsti (2025). Morpho missing data? [Dataset]. https://dune.com/discover/content/relevant?resource-type=queries&q=code%3A%22morpho_blue_multichain.morphoblue_evt_supply%22
    Explore at:
    Dataset updated
    Nov 4, 2025
    Authors
    vegarsti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Blockchain data query: Morpho missing data?

  20. Fix the Gaps: Data Hospital Simulation

    • kaggle.com
    zip
    Updated Nov 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajarajeswari P (2025). Fix the Gaps: Data Hospital Simulation [Dataset]. https://www.kaggle.com/datasets/rajarajeswariprr/fix-the-gaps-data-hospital-simulation
    Explore at:
    zip(24673 bytes)Available download formats
    Dataset updated
    Nov 25, 2025
    Authors
    Rajarajeswari P
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Activity Title: "Fix the Gaps: Data Hospital Simulation" (This activity is created for students to practice techniques to handle missing data)

    Description: Provide each team with a “broken patient record” dataset (incomplete entries with NaNs or blanks). Teams act as data doctors: • Diagnose the type of missingness (MCAR, MAR, MNAR) • Choose suitable imputation techniques (mean, median, KNN, regression) • Compare outcomes from different methods

    Tools: Jupyter notebook / Pandas

    Outcome: Group presentation on the impact of imputation and justification of the method used.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1

Identifying Missing Data Handling Methods with Text Mining

Explore at:
delimitedAvailable download formats
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered
Jan 1, 1999 - Dec 31, 2016
Description

Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

Search
Clear search
Close search
Google apps
Main menu