5 datasets found

d
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
search.dataone.org
dataverse.harvard.edu
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

Restaurant Sales-Dirty Data for Cleaning Training

kaggle.com

Updated Jan 25, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Restaurant Sales-Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/restaurant-sales-dirty-data-for-cleaning-training

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 25, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Restaurant Sales Dataset with Dirt Documentation

Overview

The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.

Dataset Use Cases

This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.

Columns Description

Column Name	Description	Example Values
`Order ID`	A unique identifier for each order.	`ORD_123456`
`Customer ID`	A unique identifier for each customer.	`CUST_001`
`Category`	The category of the purchased item.	`Main Dishes`, `Drinks`
`Item`	The name of the purchased item. May contain missing values due to data dirt.	`Grilled Chicken`, `None`
`Price`	The static price of the item. May contain missing values.	`15.0`, `None`
`Quantity`	The quantity of the purchased item. May contain missing values.	`1`, `None`
`Order Total`	The total price for the order (`Price * Quantity`). May contain missing values.	`45.0`, `None`
`Order Date`	The date when the order was placed. Always present.	`2022-01-15`
`Payment Method`	The payment method used for the transaction. May contain missing values due to data dirt.	`Cash`, `None`

Key Characteristics

Data Dirtiness:
- Missing values in key columns (Item, Price, Quantity, Order Total, Payment Method) simulate real-world challenges.
- At least one of the following conditions is ensured for each record to identify an item:
  - Item is present.
  - Price is present.
  - Both Quantity and Order Total are present.
- If Price or Quantity is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity).
Menu Categories and Items:
- Items are divided into five categories:
  - Starters: E.g., Chicken Melt, French Fries.
  - Main Dishes: E.g., Grilled Chicken, Steak.
  - Desserts: E.g., Chocolate Cake, Ice Cream.
  - Drinks: E.g., Coca Cola, Water.
  - Side Dishes: E.g., Mashed Potatoes, Garlic Bread.

3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.

Cleaning Suggestions

Handle Missing Values:
- Fill missing Order Total or Quantity using the formula: Order Total = Price * Quantity.
- Deduce missing Price from Order Total / Quantity if both are available.
Validate Data Consistency:
- Ensure that calculated values (Order Total = Price * Quantity) match.
Analyze Missing Patterns:
- Study the distribution of missing values across categories and payment methods.

Menu Map with Prices and Categories

Category	Item	Price
Starters	Chicken Melt	8.0
Starters	French Fries	4.0
Starters	Cheese Fries	5.0
Starters	Sweet Potato Fries	5.0
Starters	Beef Chili	7.0
Starters	Nachos Grande	10.0
Main Dishes	Grilled Chicken	15.0
Main Dishes	Steak	20.0
Main Dishes	Pasta Alfredo	12.0
Main Dishes	Salmon	18.0
Main Dishes	Vegetarian Platter	14.0
Desserts	Chocolate Cake	6.0
Desserts	Ice Cream	5.0
Desserts	Fruit Salad	4.0
Desserts	Cheesecake	7.0
Desserts	Brownie	6.0
Drinks	Coca Cola	2.5
Drinks	Orange Juice	3.0
Drinks ...

Autoscout Auto Listings: Complete Market Data - 3
kaggle.com
Updated Jun 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huseyin Cenik (2023). Autoscout Auto Listings: Complete Market Data - 3 [Dataset]. https://www.kaggle.com/datasets/huseyincenik/capstone-part-2-finalcsv
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 26, 2023
Dataset provided by
Kaggle
Authors
Huseyin Cenik
Description
https://cdn.pixabay.com/photo/2013/11/20/18/51/autos-214033_1280.jpg" alt="image-2.png">

🚗 About Autoscout Dataset and Handling Missing Values Section 🧹

The Autoscout dataset, available on Kaggle, provides comprehensive information about vehicles listed for sale. This dataset includes a variety of attributes detailing each vehicle, which is essential for conducting detailed analyses of the automotive market.

Part 2: Handling Missing Values

In Part 2: Handling Missing Values, the dataset has undergone rigorous cleaning to address and resolve missing values across several columns. This cleaning process ensures that the data is accurate, complete, and ready for analysis.

Data Fields:

make_model: Brand and model of the vehicle.

short_description: Brief description of the vehicle.

make: Brand or manufacturer of the vehicle.

model: Model name of the vehicle.

location: Geographical location of the vehicle.

price: Price of the vehicle.

body_type: Body type or style of the vehicle.

type: Type of the vehicle.

doors: Number of doors in the vehicle.

country_version: Country version of the vehicle.

offer_number: Offer number associated with the listing.

warranty: Warranty status of the vehicle.

mileage: Mileage or distance traveled by the vehicle.

first_registration: Date of the vehicle's first registration.

gearbox: Type of gearbox or transmission.

fuel_type: Fuel type used by the vehicle.

colour: Color of the vehicle.

paint: Type of paint used on the vehicle.

desc: Detailed description of the vehicle.

seller: Seller of the vehicle.

seats: Number of seats in the vehicle.

power: Engine power of the vehicle.

engine_size: Engine size of the vehicle.

gears: Number of gears in the vehicle.

co_emissions: CO₂ emissions of the vehicle.

manufacturer_colour: Manufacturer's designated color for the vehicle.

drivetrain: Type of drivetrain in the vehicle.

cylinders: Number of cylinders in the engine.

fuel_consumption: Fuel consumption of the vehicle.

comfort_&convenience: Comfort and convenience features.

entertainment&media: Entertainment and media features.

safety&_security: Safety and security features.

extras: Additional or extra features.

empty_weight: Empty weight of the vehicle.

model_code: Model code of the vehicle.

general_inspection: General inspection status.

last_service: Date of the last service.

full_service_history: Full service history status.

non_smoker_vehicle: Non-smoker vehicle status.

emission_class: Emission class of the vehicle.

emissions_sticker: Emissions sticker status.

upholstery_colour: Upholstery color.

upholstery: Type of upholstery.

production_date: Production date of the vehicle.

previous_owner: Previous owner information.

other_fuel_types: Other compatible fuel types.

power_consumption: Power consumption of the vehicle.

energy_efficiency_class: Energy efficiency class.

co_efficiency: CO₂ efficiency.

fuel_consumption_wltp: WLTP fuel consumption.

co_emissions_wltp: WLTP CO₂ emissions.

available_from: Availability date of the vehicle.

taxi_or_rental_car: Whether the vehicle was used as a taxi or rental car.

availability: Availability status.

last_timing_belt_change: Date of the last timing belt change.

electric_range_wltp: WLTP electric range.

power_consumption_wltp: WLTP power consumption.

battery_ownership: Battery ownership status in electric vehicles.

This cleaning process is crucial for ensuring the dataset's quality and reliability, facilitating accurate analysis and insights.
Vehicle Dataset 2024
kaggle.com
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanchana1990 (2024). Vehicle Dataset 2024 [Dataset]. http://doi.org/10.34740/kaggle/dsv/8553155
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8553155
Dataset updated
May 29, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanchana1990
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Dataset Overview

The "Vehicle Dataset 2024" provides a comprehensive look at new vehicles available in the market, including SUVs, cars, trucks, and vans. This dataset contains detailed information on various attributes such as make, model, year, price, mileage, and more. With 1002 entries and 18 columns, this dataset is ideal for data science enthusiasts and professionals looking to practice data cleaning, exploratory data analysis (EDA), and predictive modeling.

Data Science Applications

Given the richness of the data, this dataset can be used for a variety of data science applications, including but not limited to: - Price Prediction: Build models to predict vehicle prices based on features such as make, model, year, and mileage. - Market Analysis: Perform market segmentation and identify trends in vehicle types, brands, and pricing. - Descriptive Statistics: Conduct comprehensive descriptive statistical analyses to summarize and describe the main features of the dataset. - Visualization: Create visualizations to illustrate the distribution of prices, mileage, and other features across different vehicle types. - Data Cleaning: Practice data cleaning techniques, handling missing values, and transforming data for further analysis. - Feature Engineering: Develop new features to improve model performance, such as price per year or mileage per year.

Column Descriptors

name: The full name of the vehicle, including make, model, and trim.

description: A brief description of the vehicle, often including key features and selling points.

make: The manufacturer of the vehicle (e.g., Ford, Toyota, BMW).

model: The model name of the vehicle.

type: The type of the vehicle, which is "New" for all entries in this dataset.

year: The year the vehicle was manufactured.

price: The price of the vehicle in USD.

engine: Details about the engine, including type and specifications.

cylinders: The number of cylinders in the vehicle's engine.

fuel: The type of fuel used by the vehicle (e.g., Gasoline, Diesel, Electric).

mileage: The mileage of the vehicle, typically in miles.

transmission: The type of transmission (e.g., Automatic, Manual).

trim: The trim level of the vehicle, indicating different feature sets or packages.

body: The body style of the vehicle (e.g., SUV, Sedan, Pickup Truck).

doors: The number of doors on the vehicle.

exterior_color: The exterior color of the vehicle.

interior_color: The interior color of the vehicle.

drivetrain: The drivetrain of the vehicle (e.g., All-wheel Drive, Front-wheel Drive).

Ethically Mined Data

This dataset was ethically mined from cars.com using an API provided by Apify. All data collection practices adhered to the terms of service and privacy policies of the source website, ensuring the ethical use of data.

Acknowledgements

Apify: For providing the API used to scrape the data from cars.com.

Cars.com: For being the source of the vehicle data.

DALL-E 3: For generating the thumbnail image for this dataset.
PNDM Prediction Dataset
kaggle.com
Updated Apr 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salem S. (2023). PNDM Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/slmsshk/pndm-prediction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Salem S.
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Unravel the mysteries of Permanent Neonatal Diabetes Mellitus (PNDM) and help doctors diagnose this rare but life-threatening condition earlier with our simulated PNDM prediction dataset. Inspired by real-world medical data and cutting-edge research, this comprehensive dataset includes six features that could help predict PNDM: age at diagnosis, HbA1c levels, genetic information, family history, clinical features, and laboratory data. But beware! Preprocessing the data presents many challenges, including handling missing values, outliers, class imbalance, and scaling and normalization issues. To tackle these challenges, we recommend using the latest data science tools and techniques, including feature selection, imputation, outlier detection, and scaling and normalization methods. Help advance medical research and save lives by exploring the complex world of PNDM with our challenging and exciting dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning

Explore at:

Unique identifier

https://doi.org/10.7910/DVN/UPL4TT

Dataset updated

Nov 23, 2023

Dataset provided by

Harvard Dataverse

Authors

Lall, Ranjit; Robinson, Thomas

Description

Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

Clear search

Close search

Google apps

Main menu

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Restaurant Sales-Dirty Data for Cleaning Training

Restaurant Sales Dataset with Dirt Documentation

Overview

Dataset Use Cases

Columns Description

Key Characteristics

Cleaning Suggestions

Menu Map with Prices and Categories

Autoscout Auto Listings: Complete Market Data - 3

🚗 About Autoscout Dataset and Handling Missing Values Section 🧹

Part 2: Handling Missing Values

Data Fields:

Vehicle Dataset 2024

Dataset Overview

Data Science Applications

Column Descriptors

Ethically Mined Data

Acknowledgements

PNDM Prediction Dataset

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning