5 datasets found
  1. d

    Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
    Explore at:
    Dataset updated
    Nov 23, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Lall, Ranjit; Robinson, Thomas
    Description

    Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

  2. Restaurant Sales-Dirty Data for Cleaning Training

    • kaggle.com
    Updated Jan 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Restaurant Sales-Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/restaurant-sales-dirty-data-for-cleaning-training
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Restaurant Sales Dataset with Dirt Documentation

    Overview

    The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.

    Dataset Use Cases

    This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.

    Columns Description

    Column NameDescriptionExample Values
    Order IDA unique identifier for each order.ORD_123456
    Customer IDA unique identifier for each customer.CUST_001
    CategoryThe category of the purchased item.Main Dishes, Drinks
    ItemThe name of the purchased item. May contain missing values due to data dirt.Grilled Chicken, None
    PriceThe static price of the item. May contain missing values.15.0, None
    QuantityThe quantity of the purchased item. May contain missing values.1, None
    Order TotalThe total price for the order (Price * Quantity). May contain missing values.45.0, None
    Order DateThe date when the order was placed. Always present.2022-01-15
    Payment MethodThe payment method used for the transaction. May contain missing values due to data dirt.Cash, None

    Key Characteristics

    1. Data Dirtiness:

      • Missing values in key columns (Item, Price, Quantity, Order Total, Payment Method) simulate real-world challenges.
      • At least one of the following conditions is ensured for each record to identify an item:
        • Item is present.
        • Price is present.
        • Both Quantity and Order Total are present.
      • If Price or Quantity is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity).
    2. Menu Categories and Items:

      • Items are divided into five categories:
        • Starters: E.g., Chicken Melt, French Fries.
        • Main Dishes: E.g., Grilled Chicken, Steak.
        • Desserts: E.g., Chocolate Cake, Ice Cream.
        • Drinks: E.g., Coca Cola, Water.
        • Side Dishes: E.g., Mashed Potatoes, Garlic Bread.

    3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.

    Cleaning Suggestions

    1. Handle Missing Values:

      • Fill missing Order Total or Quantity using the formula: Order Total = Price * Quantity.
      • Deduce missing Price from Order Total / Quantity if both are available.
    2. Validate Data Consistency:

      • Ensure that calculated values (Order Total = Price * Quantity) match.
    3. Analyze Missing Patterns:

      • Study the distribution of missing values across categories and payment methods.

    Menu Map with Prices and Categories

    CategoryItemPrice
    StartersChicken Melt8.0
    StartersFrench Fries4.0
    StartersCheese Fries5.0
    StartersSweet Potato Fries5.0
    StartersBeef Chili7.0
    StartersNachos Grande10.0
    Main DishesGrilled Chicken15.0
    Main DishesSteak20.0
    Main DishesPasta Alfredo12.0
    Main DishesSalmon18.0
    Main DishesVegetarian Platter14.0
    DessertsChocolate Cake6.0
    DessertsIce Cream5.0
    DessertsFruit Salad4.0
    DessertsCheesecake7.0
    DessertsBrownie6.0
    DrinksCoca Cola2.5
    DrinksOrange Juice3.0
    Drinks ...
  3. Autoscout Auto Listings: Complete Market Data - 3

    • kaggle.com
    Updated Jun 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huseyin Cenik (2023). Autoscout Auto Listings: Complete Market Data - 3 [Dataset]. https://www.kaggle.com/datasets/huseyincenik/capstone-part-2-finalcsv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 26, 2023
    Dataset provided by
    Kaggle
    Authors
    Huseyin Cenik
    Description

    https://cdn.pixabay.com/photo/2013/11/20/18/51/autos-214033_1280.jpg" alt="image-2.png">

    🚗 About Autoscout Dataset and Handling Missing Values Section 🧹

    The Autoscout dataset, available on Kaggle, provides comprehensive information about vehicles listed for sale. This dataset includes a variety of attributes detailing each vehicle, which is essential for conducting detailed analyses of the automotive market.

    Part 2: Handling Missing Values

    In Part 2: Handling Missing Values, the dataset has undergone rigorous cleaning to address and resolve missing values across several columns. This cleaning process ensures that the data is accurate, complete, and ready for analysis.

    Data Fields:

    • make_model: Brand and model of the vehicle.
    • short_description: Brief description of the vehicle.
    • make: Brand or manufacturer of the vehicle.
    • model: Model name of the vehicle.
    • location: Geographical location of the vehicle.
    • price: Price of the vehicle.
    • body_type: Body type or style of the vehicle.
    • type: Type of the vehicle.
    • doors: Number of doors in the vehicle.
    • country_version: Country version of the vehicle.
    • offer_number: Offer number associated with the listing.
    • warranty: Warranty status of the vehicle.
    • mileage: Mileage or distance traveled by the vehicle.
    • first_registration: Date of the vehicle's first registration.
    • gearbox: Type of gearbox or transmission.
    • fuel_type: Fuel type used by the vehicle.
    • colour: Color of the vehicle.
    • paint: Type of paint used on the vehicle.
    • desc: Detailed description of the vehicle.
    • seller: Seller of the vehicle.
    • seats: Number of seats in the vehicle.
    • power: Engine power of the vehicle.
    • engine_size: Engine size of the vehicle.
    • gears: Number of gears in the vehicle.
    • co_emissions: COâ‚‚ emissions of the vehicle.
    • manufacturer_colour: Manufacturer's designated color for the vehicle.
    • drivetrain: Type of drivetrain in the vehicle.
    • cylinders: Number of cylinders in the engine.
    • fuel_consumption: Fuel consumption of the vehicle.
    • comfort_&convenience: Comfort and convenience features.
    • entertainment&media: Entertainment and media features.
    • safety&_security: Safety and security features.
    • extras: Additional or extra features.
    • empty_weight: Empty weight of the vehicle.
    • model_code: Model code of the vehicle.
    • general_inspection: General inspection status.
    • last_service: Date of the last service.
    • full_service_history: Full service history status.
    • non_smoker_vehicle: Non-smoker vehicle status.
    • emission_class: Emission class of the vehicle.
    • emissions_sticker: Emissions sticker status.
    • upholstery_colour: Upholstery color.
    • upholstery: Type of upholstery.
    • production_date: Production date of the vehicle.
    • previous_owner: Previous owner information.
    • other_fuel_types: Other compatible fuel types.
    • power_consumption: Power consumption of the vehicle.
    • energy_efficiency_class: Energy efficiency class.
    • co_efficiency: COâ‚‚ efficiency.
    • fuel_consumption_wltp: WLTP fuel consumption.
    • co_emissions_wltp: WLTP COâ‚‚ emissions.
    • available_from: Availability date of the vehicle.
    • taxi_or_rental_car: Whether the vehicle was used as a taxi or rental car.
    • availability: Availability status.
    • last_timing_belt_change: Date of the last timing belt change.
    • electric_range_wltp: WLTP electric range.
    • power_consumption_wltp: WLTP power consumption.
    • battery_ownership: Battery ownership status in electric vehicles.

    This cleaning process is crucial for ensuring the dataset's quality and reliability, facilitating accurate analysis and insights.

  4. Vehicle Dataset 2024

    • kaggle.com
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanchana1990 (2024). Vehicle Dataset 2024 [Dataset]. http://doi.org/10.34740/kaggle/dsv/8553155
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanchana1990
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Dataset Overview

    The "Vehicle Dataset 2024" provides a comprehensive look at new vehicles available in the market, including SUVs, cars, trucks, and vans. This dataset contains detailed information on various attributes such as make, model, year, price, mileage, and more. With 1002 entries and 18 columns, this dataset is ideal for data science enthusiasts and professionals looking to practice data cleaning, exploratory data analysis (EDA), and predictive modeling.

    Data Science Applications

    Given the richness of the data, this dataset can be used for a variety of data science applications, including but not limited to: - Price Prediction: Build models to predict vehicle prices based on features such as make, model, year, and mileage. - Market Analysis: Perform market segmentation and identify trends in vehicle types, brands, and pricing. - Descriptive Statistics: Conduct comprehensive descriptive statistical analyses to summarize and describe the main features of the dataset. - Visualization: Create visualizations to illustrate the distribution of prices, mileage, and other features across different vehicle types. - Data Cleaning: Practice data cleaning techniques, handling missing values, and transforming data for further analysis. - Feature Engineering: Develop new features to improve model performance, such as price per year or mileage per year.

    Column Descriptors

    1. name: The full name of the vehicle, including make, model, and trim.
    2. description: A brief description of the vehicle, often including key features and selling points.
    3. make: The manufacturer of the vehicle (e.g., Ford, Toyota, BMW).
    4. model: The model name of the vehicle.
    5. type: The type of the vehicle, which is "New" for all entries in this dataset.
    6. year: The year the vehicle was manufactured.
    7. price: The price of the vehicle in USD.
    8. engine: Details about the engine, including type and specifications.
    9. cylinders: The number of cylinders in the vehicle's engine.
    10. fuel: The type of fuel used by the vehicle (e.g., Gasoline, Diesel, Electric).
    11. mileage: The mileage of the vehicle, typically in miles.
    12. transmission: The type of transmission (e.g., Automatic, Manual).
    13. trim: The trim level of the vehicle, indicating different feature sets or packages.
    14. body: The body style of the vehicle (e.g., SUV, Sedan, Pickup Truck).
    15. doors: The number of doors on the vehicle.
    16. exterior_color: The exterior color of the vehicle.
    17. interior_color: The interior color of the vehicle.
    18. drivetrain: The drivetrain of the vehicle (e.g., All-wheel Drive, Front-wheel Drive).

    Ethically Mined Data

    This dataset was ethically mined from cars.com using an API provided by Apify. All data collection practices adhered to the terms of service and privacy policies of the source website, ensuring the ethical use of data.

    Acknowledgements

    • Apify: For providing the API used to scrape the data from cars.com.
    • Cars.com: For being the source of the vehicle data.
    • DALL-E 3: For generating the thumbnail image for this dataset.
  5. PNDM Prediction Dataset

    • kaggle.com
    Updated Apr 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salem S. (2023). PNDM Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/slmsshk/pndm-prediction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Salem S.
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Unravel the mysteries of Permanent Neonatal Diabetes Mellitus (PNDM) and help doctors diagnose this rare but life-threatening condition earlier with our simulated PNDM prediction dataset. Inspired by real-world medical data and cutting-edge research, this comprehensive dataset includes six features that could help predict PNDM: age at diagnosis, HbA1c levels, genetic information, family history, clinical features, and laboratory data. But beware! Preprocessing the data presents many challenges, including handling missing values, outliers, class imbalance, and scaling and normalization issues. To tackle these challenges, we recommend using the latest data science tools and techniques, including feature selection, imputation, outlier detection, and scaling and normalization methods. Help advance medical research and save lives by exploring the complex world of PNDM with our challenging and exciting dataset.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning

Related Article
Explore at:
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description

Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

Search
Clear search
Close search
Google apps
Main menu