20 datasets found
  1. Restaurant Sales-Dirty Data for Cleaning Training

    • kaggle.com
    Updated Jan 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Restaurant Sales-Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/restaurant-sales-dirty-data-for-cleaning-training
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Restaurant Sales Dataset with Dirt Documentation

    Overview

    The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.

    Dataset Use Cases

    This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.

    Columns Description

    Column NameDescriptionExample Values
    Order IDA unique identifier for each order.ORD_123456
    Customer IDA unique identifier for each customer.CUST_001
    CategoryThe category of the purchased item.Main Dishes, Drinks
    ItemThe name of the purchased item. May contain missing values due to data dirt.Grilled Chicken, None
    PriceThe static price of the item. May contain missing values.15.0, None
    QuantityThe quantity of the purchased item. May contain missing values.1, None
    Order TotalThe total price for the order (Price * Quantity). May contain missing values.45.0, None
    Order DateThe date when the order was placed. Always present.2022-01-15
    Payment MethodThe payment method used for the transaction. May contain missing values due to data dirt.Cash, None

    Key Characteristics

    1. Data Dirtiness:

      • Missing values in key columns (Item, Price, Quantity, Order Total, Payment Method) simulate real-world challenges.
      • At least one of the following conditions is ensured for each record to identify an item:
        • Item is present.
        • Price is present.
        • Both Quantity and Order Total are present.
      • If Price or Quantity is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity).
    2. Menu Categories and Items:

      • Items are divided into five categories:
        • Starters: E.g., Chicken Melt, French Fries.
        • Main Dishes: E.g., Grilled Chicken, Steak.
        • Desserts: E.g., Chocolate Cake, Ice Cream.
        • Drinks: E.g., Coca Cola, Water.
        • Side Dishes: E.g., Mashed Potatoes, Garlic Bread.

    3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.

    Cleaning Suggestions

    1. Handle Missing Values:

      • Fill missing Order Total or Quantity using the formula: Order Total = Price * Quantity.
      • Deduce missing Price from Order Total / Quantity if both are available.
    2. Validate Data Consistency:

      • Ensure that calculated values (Order Total = Price * Quantity) match.
    3. Analyze Missing Patterns:

      • Study the distribution of missing values across categories and payment methods.

    Menu Map with Prices and Categories

    CategoryItemPrice
    StartersChicken Melt8.0
    StartersFrench Fries4.0
    StartersCheese Fries5.0
    StartersSweet Potato Fries5.0
    StartersBeef Chili7.0
    StartersNachos Grande10.0
    Main DishesGrilled Chicken15.0
    Main DishesSteak20.0
    Main DishesPasta Alfredo12.0
    Main DishesSalmon18.0
    Main DishesVegetarian Platter14.0
    DessertsChocolate Cake6.0
    DessertsIce Cream5.0
    DessertsFruit Salad4.0
    DessertsCheesecake7.0
    DessertsBrownie6.0
    DrinksCoca Cola2.5
    DrinksOrange Juice3.0
    Drinks ...
  2. f

    Initial data analysis checklist for data screening in longitudinal studies.

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Initial data analysis checklist for data screening in longitudinal studies. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Initial data analysis checklist for data screening in longitudinal studies.

  3. d

    Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users |...

    • datarade.ai
    .json, .csv, .xls
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quadrant (2025). Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users | +200B Events / Month [Dataset]. https://datarade.ai/data-products/mobile-location-data-asia-300m-unique-devices-100m-da-quadrant
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Mar 21, 2025
    Dataset authored and provided by
    Quadrant
    Area covered
    Asia, Oman, Israel, Palestine, Iran (Islamic Republic of), Korea (Democratic People's Republic of), Georgia, Armenia, Bahrain, Kyrgyzstan, Philippines
    Description

    Quadrant provides Insightful, accurate, and reliable mobile location data.

    Our privacy-first mobile location data unveils hidden patterns and opportunities, provides actionable insights, and fuels data-driven decision-making at the world's biggest companies.

    These companies rely on our privacy-first Mobile Location and Points-of-Interest Data to unveil hidden patterns and opportunities, provide actionable insights, and fuel data-driven decision-making. They build better AI models, uncover business insights, and enable location-based services using our robust and reliable real-world data.

    We conduct stringent evaluations on data providers to ensure authenticity and quality. Our proprietary algorithms detect, and cleanse corrupted and duplicated data points – allowing you to leverage our datasets rapidly with minimal processing or cleaning. During the ingestion process, our proprietary Data Filtering Algorithms remove events based on a number of both qualitative factors, as well as latency and other integrity variables to provide more efficient data delivery. The deduplicating algorithm focuses on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm scours our data and identifies rows that contain the same combination of these four attributes. Post-identification, it retains a single copy and eliminates duplicate values to ensure our customers only receive complete and unique datasets.

    We actively identify overlapping values at the provider level to determine the value each offers. Our data science team has developed a sophisticated overlap analysis model that helps us maintain a high-quality data feed by qualifying providers based on unique data values rather than volumes alone – measures that provide significant benefit to our end-use partners.

    Quadrant mobility data contains all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and IP Address, and non-standard attributes such as Geohash and H3. In addition, we have historical data available back through 2022.

    Through our in-house data science team, we offer sophisticated technical documentation, location data algorithms, and queries that help data buyers get a head start on their analyses. Our goal is to provide you with data that is “fit for purpose”.

  4. H

    3. Original EPIC-1 Data Source

    • dataverse.harvard.edu
    csv, pdf, tsv, xlsx
    Updated Dec 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2016). 3. Original EPIC-1 Data Source [Dataset]. http://doi.org/10.7910/DVN/A4HJUR
    Explore at:
    tsv(3475), csv(4821), xlsx(50187), pdf(66481)Available download formats
    Dataset updated
    Dec 22, 2016
    Dataset provided by
    Harvard Dataverse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Original EPIC-1 data source and documented intermediate data manipulation. These files are provided in order to ensure a complete audit trail and documentation. These files include original source data, as well as files created in the process of cleaning and preparing the datasets found in section I of the dataverse (1. Pooled and Adjusted EPIC Data). These intermediary files contain documentation in any adjustment in assumptions, currency conversions, and data cleaning processes. Ordinarily, analysis would be done using the datasets in section I. Researchers would not find the need to use the files in this section unless for tracing the origin of the variables to the original source. “Adjustments for the EPIC-2 data is conducted with advice and input from data collection team (EPIC-1). The magnitude of these adjustments are documented in the table attached. These documented adjustments explained the lion’s share of the discrepancies, leaving only minor unaccounted differences in the data (Δ range 0% - 1.1%).” “In addition to using the sampling weights, any extrapolation to achieve nationwide cost estimates for Benin, Ghana, Zambia, and Honduras uses scale-up factor to take into account facilities that are outside of the sampling frame. For example, after taking into account the sampling weights, the total facility-level delivery cost in Benin sampling frame (343 facilities) is $2,094,031. To estimate the total facility-level delivery cost in the entire country of Benin (695 facilities), the sample-frame cost estimate is multiplied by 695/343. “Additional adjustments for the EPIC-2 analysis include the series of decisions for weighting, methods, and data sources. For EPIC-2 analyses, average costs per dose and DTP3 were calculated as total costs divided by total outputs, representing a funder’s perspective. We also report results as a simple average of the site-level cost per output. All estimates were adjusted for survey weighting. In particular, the analyses in EPIC-2 relied exclusively on information from the sample, whereas in some instance EPIC-1 teams were able to strategically leverage other available data sources.”

  5. Data from: The International Transport Energy Modeling (iTEM) Open Data &...

    • zenodo.org
    • explore.openaire.eu
    pdf
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Humberto Linero; Sonia Yeh; Sonia Yeh; Paul Kishimoto; Paul Kishimoto; Pierpaolo Cazzola; Lewis Fulton; David McCollum; Joshua Miller; Page Kyle; Manuel Pérez Bravo; Manuel Pérez Bravo; Humberto Linero; Pierpaolo Cazzola; Lewis Fulton; David McCollum; Joshua Miller; Page Kyle (2024). The International Transport Energy Modeling (iTEM) Open Data & Harmonized Transport Database [Dataset]. http://doi.org/10.5281/zenodo.13749361
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Humberto Linero; Sonia Yeh; Sonia Yeh; Paul Kishimoto; Paul Kishimoto; Pierpaolo Cazzola; Lewis Fulton; David McCollum; Joshua Miller; Page Kyle; Manuel Pérez Bravo; Manuel Pérez Bravo; Humberto Linero; Pierpaolo Cazzola; Lewis Fulton; David McCollum; Joshua Miller; Page Kyle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset and documentation contains detailed information of the iTEM Open Database, a harmonized transport data set of historical values, 1970 - present. It aims to create transparency through two key features:

    • Open-Data: Assembling a comprehensive collection of publicly-available transportation data
    • Open-Code: All code and documentation will be publicly accessible and open for modification and extension. https://github.com/transportenergy

    The iTEM Open Database is comprised of individual datasets collected from public sources. Each dataset is downloaded, cleaned, and harmonised to the common region and technology definitions defined by the iTEM consortium https://transportenergy.org. For each dataset, we describe the name of the dataset, the web link to the original source, the web link to the cleaning script (in python), variables, and explain the data cleaning steps (which explains the data cleaning script in plain English).

    Shall you find any problems with the dataset, please report the issues here https://github.com/transportenergy/database/issues.

  6. f

    Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    figshare
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  7. COVID-19 High Frequency Phone Survey of Households 2020 - Viet Nam

    • microdata.worldbank.org
    • datacatalog.ihsn.org
    • +1more
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/3813
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    Time period covered
    2020
    Area covered
    Vietnam
    Description

    Abstract

    The main objective of this project is to collect household data for the ongoing assessment and monitoring of the socio-economic impacts of COVID-19 on households and family businesses in Vietnam. The estimated field work and sample size of households in each round is as follows:

    Round 1 June fieldwork- approximately 6300 households (at least 1300 minority households) Round 2 August fieldwork - approximately 4000 households (at least 1000 minority households) Round 3 September fieldwork- approximately 4000 households (at least 1000 minority households) Round 4 December- approximately 4000 households (at least 1000 minority households) Round 5 - pending discussion

    Geographic coverage

    National, regional

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. Out of the 15 households, 3 households have information collected on both income and expenditure (large module) as well as many other aspects. The remaining 12 other households have information collected on income, but do not have information collected on expenditure (small module). Therefore, estimation of large module includes 9396 households and are representative at regional and national levels, while the whole sample is representative at the provincial level.

    We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. The sample size of large module has 9396 households, of which, there are 7951 households having phone number (cell phone or line phone).

    After data processing, the final sample size is 6,213 households.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The questionnaire for Round 1 consisted of the following sections Section 2. Behavior Section 3. Health Section 4. Education & Child caring Section 5A. Employment (main respondent) Section 5B. Employment (other household member) Section 6. Coping Section 7. Safety Nets Section 8. FIES

    Cleaning operations

    Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
    • Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

    Response rate

    The target for Round 1 is to complete interviews for 6300 households, of which 1888 households are located in urban area and 4475 households in rural area. In addition, at least 1300 ethnic minority households are to be interviewed. A random selection of 6300 households was made out of 7951 households for official interview and the rest as for replacement. However, the refusal rate of the survey was about 27 percent, and households from the small module in the same EA were contacted for replacement and these households are also randomly selected.

  8. l

    Cleaned spouse and marriage data - Malawi

    • kpsmw.lshtm.ac.uk
    Updated Oct 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Professor Amelia (Mia) Crampin (2022). Cleaned spouse and marriage data - Malawi [Dataset]. https://kpsmw.lshtm.ac.uk/nada/index.php/catalog/12
    Explore at:
    Dataset updated
    Oct 25, 2022
    Dataset authored and provided by
    Professor Amelia (Mia) Crampin
    Area covered
    Malawi
    Description

    Abstract

    The do-file marital_spouselinks.do combines all data on people's marital statuses and reported spouses to create the following datasets: 1. all_marital_reports - a listing of all the times an individual has reported their current marital status with the id numbers of the reported spouse(s); this listing is as reported so may include discrepancies (i.e. a 'Never married' status following a 'Married' one) 2. all_spouse_pairs_full - a listing of each time each spouse pair has been reported plus summary information on co-residency for each pair 3. all_spouse_pairs_clean_summarised - this summarises the data from all_spouse_pairs_full to give start and end dates of unions 4. marital_status_episodes - this combines data from all the sources to create episodes of marital status, each has a start and end date and a marital status, and if currently married, the spouse ids of the current spouse(s) if reported. There are several variables to indicate where each piece of information is coming from.

    The first 2 datasets are made available in case people need the 'raw' data for any reason (i.e. if they only want data from one study) or if they wish to summarise the data in a different way to what is done for the last 2 datasets.

    The do-file is quite complicated with many sources of data going through multiple processes to create variables in the datasets so it is not always straightforward to explain where each variable come from on the documentation. The 4 datasets build on each other and the do-file is documented throughout so anyone wanting to understand in great detail may be better off examining that. However, below is a brief description of how the datasets are created:

    Marital status data are stored in the tables of the study they were collected in: AHS Adult Health Study [ahs_ahs1] CEN Census (initial CRS census) [cen_individ] CENM In-migration (CRS migration form) [crs_cenm] GP General form (filled for various reasons) [gp_gpform] SEI Socio-economic individual (annual survey from 2007 onwards) [css_sei] TBH TB household (study of household contacts of TB patients) [tb_tbh] TBO TB controls (matched controls for TB patients) [tb_tbo & tb_tboto2007] TBX TB cases (TB patients) [tb_tbx & tb_tbxto2007] In many of the above surveys as well as their current marital status, people were asked to report their current and past spouses along with (sometimes) some information about the marriage (start/end year etc.). These data are stored all together on the table gen_spouse, with variables indicating which study the data came from. Further evidence of spousal relationships is taken from gen_identity (if a couple appear as co-parents to a CRS member) and from crs_residency_episodes_clean_poly, a combined dataset (if they are living in the same household at the same time). Note that co-parent couples who are not reported in gen_spouse are only retained in the datasets if they have co-resident episodes.

    The marital status data are appended together and the spouse id data merged in. Minimal data editing/cleaning is carried out. As the spouse data are in long format, this dataset is reshaped wide to have one line per marital status report (polygamy in the area allows for men to have multiple spouses at one time): this dataset is saved as all_marital_reports.

    The list of reported spouses on gen_spouse is appended to a list of co-parents (from gen_identity) and this list is cleaned to try to identify and remove obvious id errors (incestuous links, same sex [these are not reported in this culture] and large age difference). Data reported by men and women are compared and variables created to show whether one or both of the couple report the union. Many records have information on start and end year of marriage, and all have the date the union was reported. This listing is compared to data from residency episodes to add dates that couples were living together (not all have start/end dates so this is to try to supplement this), in addition the dates that each member of the couple was last known to be alive or first known to be dead are added (from the residency data as well). This dataset with all the records available for each spouse pair is saved as all_spouse_pairs_full.

    The date data from all_spouse_pairs_full are then summarised to get one line per couple with earliest and latest known married date for all, and, if available, marriage and separation date. For each date there are also variables created to indicate the source of the data.
    As culture only allows for women having one spouse at a time, records for women with 'overlapping' husbands are cleaned. This dataset is then saved as all_spouse_pairs_clean_summarised.

    Both the cleaned spouse pairs and the cleaned marital status datasets are converted into episodes: the spouse listing using the marriage or first known married date as the beginning and the last known married plus a year or separation date as the end, the marital status data records collapsed into periods of the same status being reported (following some cleaning to remove impossible reports) and the start date being the first of these reports, the end date being the last of the reports plus a year. These episodes are appended together and a series of processes run several times to remove overalapping episodes. To be able to assign specific spouse ids to each married episode, some episodes need to be 'split' into more than one (i.e. if a man is married to one woman from 2005 to 2017 and then marries another woman in 2008 and remains married to her till 2017 his intial married episode would be from 2005 to 2017, but this would need to be split into one from 2005 to 2008 which would just have 1 idspouse attached and another from 2008 to 2017, which would have 2 idspouse attached). After this splitting process the spouse ids are merged in.
    The final episode dataset is saved as marital_status_episodes.

    Analysis unit

    Individual

    Mode of data collection

    Face-to-face [f2f]

  9. d

    ETH Travel Data Archive

    • datadiscoverystudio.org
    resource url
    Updated 2001
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2001). ETH Travel Data Archive [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/adb1aec7d89e45caa553f99c92ae2ab4/html
    Explore at:
    resource urlAvailable download formats
    Dataset updated
    2001
    Description

    Link Function: information

  10. d

    Mobile Location Data | United States | +300M Unique Devices | +150M Daily...

    • datarade.ai
    .json, .xml, .csv
    Updated Jul 7, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quadrant (2020). Mobile Location Data | United States | +300M Unique Devices | +150M Daily Users | +200B Events / Month [Dataset]. https://datarade.ai/data-products/mobile-location-data-us
    Explore at:
    .json, .xml, .csvAvailable download formats
    Dataset updated
    Jul 7, 2020
    Dataset authored and provided by
    Quadrant
    Area covered
    United States
    Description

    Quadrant provides Insightful, accurate, and reliable mobile location data.

    Our privacy-first mobile location data unveils hidden patterns and opportunities, provides actionable insights, and fuels data-driven decision-making at the world's biggest companies.

    These companies rely on our privacy-first Mobile Location and Points-of-Interest Data to unveil hidden patterns and opportunities, provide actionable insights, and fuel data-driven decision-making. They build better AI models, uncover business insights, and enable location-based services using our robust and reliable real-world data.

    We conduct stringent evaluations on data providers to ensure authenticity and quality. Our proprietary algorithms detect, and cleanse corrupted and duplicated data points – allowing you to leverage our datasets rapidly with minimal processing or cleaning. During the ingestion process, our proprietary Data Filtering Algorithms remove events based on a number of both qualitative factors, as well as latency and other integrity variables to provide more efficient data delivery. The deduplicating algorithm focuses on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm scours our data and identifies rows that contain the same combination of these four attributes. Post-identification, it retains a single copy and eliminates duplicate values to ensure our customers only receive complete and unique datasets.

    We actively identify overlapping values at the provider level to determine the value each offers. Our data science team has developed a sophisticated overlap analysis model that helps us maintain a high-quality data feed by qualifying providers based on unique data values rather than volumes alone – measures that provide significant benefit to our end-use partners.

    Quadrant mobility data contains all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and IP Address, and non-standard attributes such as Geohash and H3. In addition, we have historical data available back through 2022.

    Through our in-house data science team, we offer sophisticated technical documentation, location data algorithms, and queries that help data buyers get a head start on their analyses. Our goal is to provide you with data that is “fit for purpose”.

  11. e

    Survey of the assessment and citizen awareness on urban cleaning in the city...

    • data.europa.eu
    unknown
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayuntamiento de Madrid, Survey of the assessment and citizen awareness on urban cleaning in the city of Madrid (series) [Dataset]. https://data.europa.eu/data/datasets/https-datos-madrid-es-egob-catalogo-300256-0-encuesta-ciudadana-limpieza
    Explore at:
    unknown(22528), unknown(48128), unknown(256000), unknown(247808)Available download formats
    Dataset provided by
    Madrid City Councilhttp://www.madrid.es/
    Authors
    Ayuntamiento de Madrid
    License

    https://datos.madrid.es/egob/catalogo/aviso-legalhttps://datos.madrid.es/egob/catalogo/aviso-legal

    Description

    The City Council of Madrid aims to promote the quality of life in the city, with urban cleaning being one of its main aspects. In order to incorporate the opinion of the citizens on the way in which the streets are dirty, this survey is carried out. The 'Associated documentation ' section includes the data structure file (Registration design, values and field structure of the results file), the data sheet and the questionnaire.

  12. Cyclistic-Data-202011-202110

    • kaggle.com
    Updated Nov 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaad Nahshon (2021). Cyclistic-Data-202011-202110 [Dataset]. https://www.kaggle.com/yaadnahshon/cyclisticdata202011202110/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yaad Nahshon
    Description

    Capstone case study from Google Data Analytics Professional Certificate program.

    This dataset was collected by Motivate International Inc. I've included only the last 12 months, from November 2020 to October 2021.

    Introduction

    Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

    Scenario

    You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

    Moreno, the director of marketing and your manager, has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

    Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently? You will produce a report with the following deliverables: 1. A clear statement of the business task 2. A description of all data sources used 3. Documentation of any cleaning or manipulation of data 4. A summary of your analysis 5. Supporting visualizations and key findings 6. Your top three recommendations based on your analysis

  13. Bellabeat - Capstone

    • kaggle.com
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VMB2021 (2022). Bellabeat - Capstone [Dataset]. https://www.kaggle.com/vmb2021/bellabeat-capstone
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    VMB2021
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Business task Provide a high-level recommendation to help guide Bellabeat’s marketing strategy to unlock new growth opportunities.

    Key stakeholders Urška Sršen, cofounder and Chief Creative Officer of Bellabeat Sando Mur, Mathematician and Bellabeat’s cofounder

    Data sources used https://www.kaggle.com/arashnic/fitbit FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius)

    Documentation of any cleaning or manipulation of data

    RStudio Cloud is the best tool for this project due to data size. Packages used

    install.packages("lubridate") library(lubridate)

    library(ggplot2)

    install.packages("dplyr")
    library(dplyr)

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  14. Smart Clean-In-Place Skid Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Smart Clean-In-Place Skid Market Research Report 2033 [Dataset]. https://dataintelo.com/report/smart-clean-in-place-skid-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Jun 28, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Smart Clean-In-Place Skid Market Outlook



    According to our latest research, the global smart clean-in-place (CIP) skid market size was valued at USD 1.54 billion in 2024, with a robust growth trajectory anticipated over the coming years. The market is projected to reach USD 3.47 billion by 2033, expanding at a compelling CAGR of 9.4% from 2025 to 2033. This significant growth is primarily driven by the increasing demand for automation and efficiency in cleaning processes across industries such as food & beverage, pharmaceuticals, and chemicals. As per the latest research, the integration of advanced sensors and controllers, coupled with stringent hygiene regulations, continues to propel the adoption of smart CIP skids globally.




    A primary growth factor for the smart clean-in-place skid market is the escalating emphasis on food safety and regulatory compliance. Industries such as food & beverage and pharmaceuticals are under constant scrutiny to maintain high standards of cleanliness and sanitation in their production environments. The implementation of smart CIP skids helps companies adhere to these regulations by delivering precise, repeatable, and validated cleaning cycles, thereby minimizing the risk of contamination and product recalls. Additionally, the ability of these systems to automate cleaning protocols reduces the need for manual intervention, further enhancing operational efficiency and ensuring that hygiene standards are consistently met. The growing consumer awareness regarding food safety and the increasing stringency of global health regulations are compelling manufacturers to invest in advanced CIP technologies, fueling market growth.




    Another significant driver is the rising trend of process optimization and resource efficiency within industrial operations. Smart CIP skids are equipped with advanced components such as sensors, controllers, and automated valves, which enable real-time monitoring and control of cleaning parameters. This technological advancement leads to substantial savings in water, energy, and cleaning agents, aligning with the sustainability goals of modern enterprises. Moreover, the integration of data analytics and IoT connectivity allows for predictive maintenance and performance optimization, reducing downtime and operational costs. As industries continue to prioritize sustainable practices and cost reduction, the adoption of intelligent CIP solutions is expected to accelerate, further bolstering market expansion over the forecast period.




    The rapid pace of digital transformation and Industry 4.0 initiatives is also playing a pivotal role in shaping the smart CIP skid market landscape. Manufacturers are increasingly leveraging automation and digitalization to enhance production flexibility, traceability, and quality assurance. Smart CIP skids, with their ability to seamlessly integrate into existing manufacturing execution systems (MES) and supervisory control and data acquisition (SCADA) platforms, offer unparalleled benefits in terms of process transparency and control. This integration not only improves cleaning validation and documentation but also supports remote monitoring and diagnostics, enabling swift response to process deviations. The ongoing adoption of smart factory concepts and the proliferation of connected devices are expected to create new avenues for market growth, especially in regions with advanced industrial infrastructure.




    From a regional perspective, Asia Pacific is emerging as a key growth engine for the smart clean-in-place skid market, driven by rapid industrialization and expanding manufacturing sectors in countries such as China, India, and Southeast Asia. North America and Europe continue to lead in terms of technological innovation and regulatory compliance, with established players investing heavily in automation and digitalization. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, supported by increasing investments in food processing and pharmaceutical manufacturing. The diverse regional dynamics and varying adoption rates underscore the global nature of the smart CIP skid market, with each region presenting unique opportunities and challenges for stakeholders.



    Sensors Analysis



    Sensors play a foundational role in the smart clean-in-place skid market, serving as the primary means of collecting real-time data on critical process parameters such as temperature, pressure, flow rate, and chemical concentration. The i

  15. f

    Raw Data.

    • plos.figshare.com
    • figshare.com
    xlsx
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Zhou; Wei Guo; Dongling Liu; Jianrong Li; Caixia Yang; Ying Wang; Xiaoyi Huang (2025). Raw Data. [Dataset]. http://doi.org/10.1371/journal.pone.0326380.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 1, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Juan Zhou; Wei Guo; Dongling Liu; Jianrong Li; Caixia Yang; Ying Wang; Xiaoyi Huang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaning indicators are widely used to evaluate the efficacy of cleaning processes in automated washer-disinfectors (AWDs) in healthcare settings. In this study, we systematically analyzed the performance of commercial indicators across multiple simulated cleaning protocols to guide the correct selection of suitable cleaning indicators in Central Sterile Supply Departments (CSSD). Eleven commercially available cleaning indicators were tested in five cleaning simulations, P0 to P4, where P1 represented the standard cleaning process in CSSD, while P2-P4 incorporated induced-error cleaning processes to mimic real-world errors. All indicators were uniformly positioned at the top level of the cleaning rack to ensure comparable exposure. Key parameters, including indicator response dynamics (e.g., wash-off sequence) and final residue results, were documented throughout the cleaning cycles. The final wash-off results given by the indicators under P0, in which no detergent was injected, were much worse than those of the other four processes. Under different simulations, the final results of the indicators and their wash-off sequences changed substantially. In conclusion, an effective indicator must be selected experimentally. The last indicator to be washed off during the normal cleaning process that can simultaneously clearly show the presence of dirt residue under induced error conditions is the optimal indicator for monitoring cleaning processes.

  16. f

    Number of interviews per participant.

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Number of interviews per participant. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Initial data analysis (IDA) is the part of the data pipeline that takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. Systematic IDA and clear reporting of the IDA findings is an important step towards reproducible research. A general framework of IDA for observational studies includes data cleaning, data screening, and possible updates of pre-planned statistical analyses. Longitudinal studies, where participants are observed repeatedly over time, pose additional challenges, as they have special features that should be taken into account in the IDA steps before addressing the research question. We propose a systematic approach in longitudinal studies to examine data properties prior to conducting planned statistical analyses. In this paper we focus on the data screening element of IDA, assuming that the research aims are accompanied by an analysis plan, meta-data are well documented, and data cleaning has already been performed. IDA data screening comprises five types of explorations, covering the analysis of participation profiles over time, evaluation of missing data, presentation of univariate and multivariate descriptions, and the depiction of longitudinal aspects. Executing the IDA plan will result in an IDA report to inform data analysts about data properties and possible implications for the analysis plan—another element of the IDA framework. Our framework is illustrated focusing on hand grip strength outcome data from a data collection across several waves in a complex survey. We provide reproducible R code on a public repository, presenting a detailed data screening plan for the investigation of the average rate of age-associated decline of grip strength. With our checklist and reproducible R code we provide data analysts a framework to work with longitudinal data in an informed way, enhancing the reproducibility and validity of their work.

  17. f

    Percentage (%) and number (n) of missing values in the explanatory variables...

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Percentage (%) and number (n) of missing values in the explanatory variables and outcome by measurement occasion and sex. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PA: physical activity. Here we show only the first interview data for variables used as time-fixed in the model (height, education and smoking—following the change suggested by IDA) and remove the observations missing by design.

  18. Correlations (above diagonal), standard deviations (diagonal) and...

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Correlations (above diagonal), standard deviations (diagonal) and covariances (below diagonal) of grip strength across waves for males.

  19. f

    Percentage (%) and number (n) of missing values in the outcome (maximum grip...

    • plos.figshare.com
    xls
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Percentage (%) and number (n) of missing values in the outcome (maximum grip strength) among participants that were interviewed, by age group and sex using all available data.

  20. f

    NHANES 1988-2018

    • figshare.com
    application/gzip
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v3
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    figshare
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 134,310 participants and 4,740 variables. The variables convey 1) demographic information, 2) dietary consumption, 3) physical examination results, 4) occupation, 5) questionnaire items (e.g., physical activity, general health status, medical conditions), 6) medications, 7) mortality status linked from the National Death Index, 8) survey weights, 9) environmental exposure biomarker measurements, and 10) chemical comments that indicate which measurements are below or above the lower limit of detection. We also provide a data dictionary listing the variables and their descriptions to help researchers browse the data. We also provide R markdown files to show example codes on calculating summary statistics and running regression models to help accelerate high-throughput analysis of the exposome and secular trends on cancer mortality. csv Data Record: The curated NHANES datasets and the data dictionaries includes 13 .csv files and 1 excel file. The curated NHANES datasets involves 10 .csv formatted files, one for each module and labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. The eleventh file is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 4,740 variables in NHANES ("dictionary_nhanes.csv"). The 12th csv file contains the harmonized categories for the categorical variables ("dictionary_harmonized_categories.csv"). The 13th file contains the dictionary for descriptors on the drugs codes (“dictionary_drug_codes.csv”). The 14th file is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES datasets (“nhanes_inconsistencies_documentation.xlsx”). R Data Record: For researchers who want to conduct their analysis in the R programming language, the curated NHANES datasets and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. We provided an .RData file that contains all the aforementioned datasets as R data objects (“w - nhanes_1988_2018.RData”). Also in this .RData file, we make available all R scripts on customized functions that were written to curate the data. We also provide an .R file that shows how we used the customized functions (i.e. our pipeline) to curate the data (“m - nhanes_1988_2018.R”).

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ahmed Mohamed (2025). Restaurant Sales-Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/restaurant-sales-dirty-data-for-cleaning-training
Organization logo

Restaurant Sales-Dirty Data for Cleaning Training

Welcome to All Scientist Restaurant

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 25, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ahmed Mohamed
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Restaurant Sales Dataset with Dirt Documentation

Overview

The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.

Dataset Use Cases

This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.

Columns Description

Column NameDescriptionExample Values
Order IDA unique identifier for each order.ORD_123456
Customer IDA unique identifier for each customer.CUST_001
CategoryThe category of the purchased item.Main Dishes, Drinks
ItemThe name of the purchased item. May contain missing values due to data dirt.Grilled Chicken, None
PriceThe static price of the item. May contain missing values.15.0, None
QuantityThe quantity of the purchased item. May contain missing values.1, None
Order TotalThe total price for the order (Price * Quantity). May contain missing values.45.0, None
Order DateThe date when the order was placed. Always present.2022-01-15
Payment MethodThe payment method used for the transaction. May contain missing values due to data dirt.Cash, None

Key Characteristics

  1. Data Dirtiness:

    • Missing values in key columns (Item, Price, Quantity, Order Total, Payment Method) simulate real-world challenges.
    • At least one of the following conditions is ensured for each record to identify an item:
      • Item is present.
      • Price is present.
      • Both Quantity and Order Total are present.
    • If Price or Quantity is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity).
  2. Menu Categories and Items:

    • Items are divided into five categories:
      • Starters: E.g., Chicken Melt, French Fries.
      • Main Dishes: E.g., Grilled Chicken, Steak.
      • Desserts: E.g., Chocolate Cake, Ice Cream.
      • Drinks: E.g., Coca Cola, Water.
      • Side Dishes: E.g., Mashed Potatoes, Garlic Bread.

3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.

Cleaning Suggestions

  1. Handle Missing Values:

    • Fill missing Order Total or Quantity using the formula: Order Total = Price * Quantity.
    • Deduce missing Price from Order Total / Quantity if both are available.
  2. Validate Data Consistency:

    • Ensure that calculated values (Order Total = Price * Quantity) match.
  3. Analyze Missing Patterns:

    • Study the distribution of missing values across categories and payment methods.

Menu Map with Prices and Categories

CategoryItemPrice
StartersChicken Melt8.0
StartersFrench Fries4.0
StartersCheese Fries5.0
StartersSweet Potato Fries5.0
StartersBeef Chili7.0
StartersNachos Grande10.0
Main DishesGrilled Chicken15.0
Main DishesSteak20.0
Main DishesPasta Alfredo12.0
Main DishesSalmon18.0
Main DishesVegetarian Platter14.0
DessertsChocolate Cake6.0
DessertsIce Cream5.0
DessertsFruit Salad4.0
DessertsCheesecake7.0
DessertsBrownie6.0
DrinksCoca Cola2.5
DrinksOrange Juice3.0
Drinks ...
Search
Clear search
Close search
Google apps
Main menu