Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.
This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.
Column Name | Description | Example Values |
---|---|---|
Order ID | A unique identifier for each order. | ORD_123456 |
Customer ID | A unique identifier for each customer. | CUST_001 |
Category | The category of the purchased item. | Main Dishes , Drinks |
Item | The name of the purchased item. May contain missing values due to data dirt. | Grilled Chicken , None |
Price | The static price of the item. May contain missing values. | 15.0 , None |
Quantity | The quantity of the purchased item. May contain missing values. | 1 , None |
Order Total | The total price for the order (Price * Quantity ). May contain missing values. | 45.0 , None |
Order Date | The date when the order was placed. Always present. | 2022-01-15 |
Payment Method | The payment method used for the transaction. May contain missing values due to data dirt. | Cash , None |
Data Dirtiness:
Item
, Price
, Quantity
, Order Total
, Payment Method
) simulate real-world challenges.Item
is present.Price
is present.Quantity
and Order Total
are present.Price
or Quantity
is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity
).Menu Categories and Items:
Chicken Melt
, French Fries
.Grilled Chicken
, Steak
.Chocolate Cake
, Ice Cream
.Coca Cola
, Water
.Mashed Potatoes
, Garlic Bread
.3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.
Handle Missing Values:
Order Total
or Quantity
using the formula: Order Total = Price * Quantity
.Price
from Order Total / Quantity
if both are available.Validate Data Consistency:
Order Total = Price * Quantity
) match.Analyze Missing Patterns:
Category | Item | Price |
---|---|---|
Starters | Chicken Melt | 8.0 |
Starters | French Fries | 4.0 |
Starters | Cheese Fries | 5.0 |
Starters | Sweet Potato Fries | 5.0 |
Starters | Beef Chili | 7.0 |
Starters | Nachos Grande | 10.0 |
Main Dishes | Grilled Chicken | 15.0 |
Main Dishes | Steak | 20.0 |
Main Dishes | Pasta Alfredo | 12.0 |
Main Dishes | Salmon | 18.0 |
Main Dishes | Vegetarian Platter | 14.0 |
Desserts | Chocolate Cake | 6.0 |
Desserts | Ice Cream | 5.0 |
Desserts | Fruit Salad | 4.0 |
Desserts | Cheesecake | 7.0 |
Desserts | Brownie | 6.0 |
Drinks | Coca Cola | 2.5 |
Drinks | Orange Juice | 3.0 |
Drinks ... |
https://cdn.pixabay.com/photo/2013/11/20/18/51/autos-214033_1280.jpg" alt="image-2.png">
The Autoscout dataset, available on Kaggle, provides comprehensive information about vehicles listed for sale. This dataset includes a variety of attributes detailing each vehicle, which is essential for conducting detailed analyses of the automotive market.
In Part 2: Handling Missing Values, the dataset has undergone rigorous cleaning to address and resolve missing values across several columns. This cleaning process ensures that the data is accurate, complete, and ready for analysis.
This cleaning process is crucial for ensuring the dataset's quality and reliability, facilitating accurate analysis and insights.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The "Vehicle Dataset 2024" provides a comprehensive look at new vehicles available in the market, including SUVs, cars, trucks, and vans. This dataset contains detailed information on various attributes such as make, model, year, price, mileage, and more. With 1002 entries and 18 columns, this dataset is ideal for data science enthusiasts and professionals looking to practice data cleaning, exploratory data analysis (EDA), and predictive modeling.
Given the richness of the data, this dataset can be used for a variety of data science applications, including but not limited to: - Price Prediction: Build models to predict vehicle prices based on features such as make, model, year, and mileage. - Market Analysis: Perform market segmentation and identify trends in vehicle types, brands, and pricing. - Descriptive Statistics: Conduct comprehensive descriptive statistical analyses to summarize and describe the main features of the dataset. - Visualization: Create visualizations to illustrate the distribution of prices, mileage, and other features across different vehicle types. - Data Cleaning: Practice data cleaning techniques, handling missing values, and transforming data for further analysis. - Feature Engineering: Develop new features to improve model performance, such as price per year or mileage per year.
This dataset was ethically mined from cars.com using an API provided by Apify. All data collection practices adhered to the terms of service and privacy policies of the source website, ensuring the ethical use of data.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Unravel the mysteries of Permanent Neonatal Diabetes Mellitus (PNDM) and help doctors diagnose this rare but life-threatening condition earlier with our simulated PNDM prediction dataset. Inspired by real-world medical data and cutting-edge research, this comprehensive dataset includes six features that could help predict PNDM: age at diagnosis, HbA1c levels, genetic information, family history, clinical features, and laboratory data. But beware! Preprocessing the data presents many challenges, including handling missing values, outliers, class imbalance, and scaling and normalization issues. To tackle these challenges, we recommend using the latest data science tools and techniques, including feature selection, imputation, outlier detection, and scaling and normalization methods. Help advance medical research and save lives by exploring the complex world of PNDM with our challenging and exciting dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.