100+ datasets found
  1. Dirty Dataset to practice Data Cleaning

    • kaggle.com
    zip
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amrutha yenikonda (2023). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning
    Explore at:
    zip(1241 bytes)Available download formats
    Dataset updated
    Nov 3, 2023
    Authors
    Amrutha yenikonda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset has been obtained by web scraping a Wikipedia page for which code is linked below: https://www.kaggle.com/amruthayenikonda/simple-web-scraping-using-pandas

    This dataset can be used to practice data cleaning and manipulation for example dropping of unwanted columns, null vales, removing symbols etc

  2. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  3. Dirty Excel Data

    • kaggle.com
    zip
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiva Vashishtha (2022). Dirty Excel Data [Dataset]. https://www.kaggle.com/datasets/shivavashishtha/dirty-excel-data
    Explore at:
    zip(13123 bytes)Available download formats
    Dataset updated
    Feb 23, 2022
    Authors
    Shiva Vashishtha
    Description

    Dataset

    This dataset was created by Shiva Vashishtha

    Contents

  4. Dirty Dataset to practice Data Cleaning

    • kaggle.com
    zip
    Updated May 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Kanju (2024). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/martinkanju/dirty-dataset-to-practice-data-cleaning
    Explore at:
    zip(1235 bytes)Available download formats
    Dataset updated
    May 20, 2024
    Authors
    Martin Kanju
    Description

    Dataset

    This dataset was created by Martin Kanju

    Released under Other (specified in description)

    Contents

  5. Data Cleaning Project

    • kaggle.com
    zip
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohanad Hazem Qabil (2024). Data Cleaning Project [Dataset]. https://www.kaggle.com/datasets/muhannadhazemqabil/data-cleaning-project
    Explore at:
    zip(79166 bytes)Available download formats
    Dataset updated
    Aug 19, 2024
    Authors
    Mohanad Hazem Qabil
    Description

    Dataset

    This dataset was created by Mohanad Hazem Qabil

    Contents

  6. q

    Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio

    • qubeshub.org
    Updated Jul 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelly Gaynor (2020). Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio [Dataset]. http://doi.org/10.25334/DRGD-F069
    Explore at:
    Dataset updated
    Jul 16, 2020
    Dataset provided by
    QUBES
    Authors
    Shelly Gaynor
    Description

    Access and clean an open source herbarium dataset using Excel or RStudio.

  7. Dirty data to clean What's wrong with this dataset

    • kaggle.com
    zip
    Updated Mar 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JoannaN_PL_KRK (2025). Dirty data to clean What's wrong with this dataset [Dataset]. https://www.kaggle.com/datasets/joannanplkrk/dirty-data-to-clean-whats-wrong-with-this-dataset
    Explore at:
    zip(1835899 bytes)Available download formats
    Dataset updated
    Mar 29, 2025
    Authors
    JoannaN_PL_KRK
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains ~1000 lines with data about animals spotted in Central/Eastern in 2024 (animal types, country, geolocation - latitude/longitude, gender, estimated height and body length.

    The data was artificially-generated.

    The primary purpose of this dataset is data-cleaning; it can be used also for data visualization and geospatial analysis (e.g. with folium). This dataset has multiple issues, including: - duplicates, - missing data, - errors, - wrong formats.

    📌 The messy data to clean is saved in animal_data_dirty1.csv . The cleaner version is saved in animal_data_reworked.

    📌 An example of data cleaning was saved in the notebook below:

    https://www.kaggle.com/code/joannanplkrk/cleaning-messy-data

    📌A notebook with geospatial analysis with folium (incl. heatmaps and choroplets is here):

    https://www.kaggle.com/code/joannanplkrk/folium-geospatial-analysis-of-animal-observations

    The result of the cleaning was saved in file: animal_data_reworked.csv

    The most important information: - the animal observations cover the timespan between March and June 2024, - the place of observations was Central/Eastern Europe, - the data was edited by our colleagues - Anne Anthony, Bob Bobson, John Johnson and James Johnson. Previously, the data prepared by our colleagues contained errors, including e.g. dates in the wrong format (we agreed on European format, but in some cases US data format was used etc.)

  8. Restaurant Sales-Dirty Data for Cleaning Training

    • kaggle.com
    zip
    Updated Jan 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Restaurant Sales-Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/restaurant-sales-dirty-data-for-cleaning-training
    Explore at:
    zip(236390 bytes)Available download formats
    Dataset updated
    Jan 25, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Restaurant Sales Dataset with Dirt Documentation

    Overview

    The Restaurant Sales Dataset with Dirt contains data for 17,534 transactions. The data introduces realistic inconsistencies ("dirt") to simulate real-world scenarios where data may have missing or incomplete information. The dataset includes sales details across multiple categories, such as starters, main dishes, desserts, drinks, and side dishes.

    Dataset Use Cases

    This dataset is suitable for: - Practicing data cleaning tasks, such as handling missing values and deducing missing information. - Conducting exploratory data analysis (EDA) to study restaurant sales patterns. - Feature engineering to create new variables for machine learning tasks.

    Columns Description

    Column NameDescriptionExample Values
    Order IDA unique identifier for each order.ORD_123456
    Customer IDA unique identifier for each customer.CUST_001
    CategoryThe category of the purchased item.Main Dishes, Drinks
    ItemThe name of the purchased item. May contain missing values due to data dirt.Grilled Chicken, None
    PriceThe static price of the item. May contain missing values.15.0, None
    QuantityThe quantity of the purchased item. May contain missing values.1, None
    Order TotalThe total price for the order (Price * Quantity). May contain missing values.45.0, None
    Order DateThe date when the order was placed. Always present.2022-01-15
    Payment MethodThe payment method used for the transaction. May contain missing values due to data dirt.Cash, None

    Key Characteristics

    1. Data Dirtiness:

      • Missing values in key columns (Item, Price, Quantity, Order Total, Payment Method) simulate real-world challenges.
      • At least one of the following conditions is ensured for each record to identify an item:
        • Item is present.
        • Price is present.
        • Both Quantity and Order Total are present.
      • If Price or Quantity is missing, the other is used to deduce the missing value (e.g., Order Total / Quantity).
    2. Menu Categories and Items:

      • Items are divided into five categories:
        • Starters: E.g., Chicken Melt, French Fries.
        • Main Dishes: E.g., Grilled Chicken, Steak.
        • Desserts: E.g., Chocolate Cake, Ice Cream.
        • Drinks: E.g., Coca Cola, Water.
        • Side Dishes: E.g., Mashed Potatoes, Garlic Bread.

    3 Time Range: - Orders span from January 1, 2022, to December 31, 2023.

    Cleaning Suggestions

    1. Handle Missing Values:

      • Fill missing Order Total or Quantity using the formula: Order Total = Price * Quantity.
      • Deduce missing Price from Order Total / Quantity if both are available.
    2. Validate Data Consistency:

      • Ensure that calculated values (Order Total = Price * Quantity) match.
    3. Analyze Missing Patterns:

      • Study the distribution of missing values across categories and payment methods.

    Menu Map with Prices and Categories

    CategoryItemPrice
    StartersChicken Melt8.0
    StartersFrench Fries4.0
    StartersCheese Fries5.0
    StartersSweet Potato Fries5.0
    StartersBeef Chili7.0
    StartersNachos Grande10.0
    Main DishesGrilled Chicken15.0
    Main DishesSteak20.0
    Main DishesPasta Alfredo12.0
    Main DishesSalmon18.0
    Main DishesVegetarian Platter14.0
    DessertsChocolate Cake6.0
    DessertsIce Cream5.0
    DessertsFruit Salad4.0
    DessertsCheesecake7.0
    DessertsBrownie6.0
    DrinksCoca Cola2.5
    DrinksOrange Juice3.0
    Drinks ...
  9. Is it time to stop sweeping data cleaning under the carpet? A novel...

    • plos.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data [Dataset]. http://doi.org/10.1371/journal.pone.0228154
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

  10. Full Dataset prior to Cleaning

    • figshare.com
    zip
    Updated Mar 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paige Chesshire (2023). Full Dataset prior to Cleaning [Dataset]. http://doi.org/10.6084/m9.figshare.22455616.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Paige Chesshire
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes all of the data downloaded from GBIF (DOIs provided in README.md as well as below, downloaded Feb 2021) as well as data downloaded from SCAN. This dataset has 2,808,432 records and can be used as a reference to the verbatim data before it underwent the cleaning process. The only modifications made to this datset after direct download from the data portals are the following:

    1) for GBIF records, I renamed the countryCode column to be "country" so that the column title is consistent across both GBIF and SCAN 2) A source column was added where I specify if the record came from GBIF or SCAN 3) Duplicate records across SCAN and GBIF were removed by identifying identical instances "catalogNumber" and "institutionCode" 4) Only the Darwin core columns (DwC) that were shared across downloaded datasets were retained. GBIF contained ~249 DwC variables, and SCAN data contained fewer, so this combined dataset only includes the ~80 columns shared between the two datasets

    For GBIF, we downloaded the data in three separate chunks, therefore there are three DOIs. See below:

    GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.6cxfsw GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.b9rfa7 GBIF.org (3 February 2021) GBIF Occurrence Downloadhttps://doi.org/10.15468/dl.w2nndm

  11. I

    Data for A Conceptual Model for Transparent, Reusable, and Collaborative...

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikolaus Parulian (2023). Data for A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning [Dataset]. http://doi.org/10.13012/B2IDB-6827044_V1
    Explore at:
    Dataset updated
    Jul 12, 2023
    Authors
    Nikolaus Parulian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dissertation_demo.zip contains the base code and demonstration purpose for the dissertation: A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning. Each chapter has a demo folder for demonstrating provenance queries or tools. The Airbnb dataset for demonstration and simulation is not included in this demo but is available to access directly from the reference website. Any updates on demonstration and examples can be found online at: https://github.com/nikolausn/dissertation_demo

  12. Practice Data Cleaning: Synthetic Customer

    • kaggle.com
    zip
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hassane Skikri (2024). Practice Data Cleaning: Synthetic Customer [Dataset]. https://www.kaggle.com/datasets/hassaneskikri/practice-data-cleaning-synthetic-customer
    Explore at:
    zip(1115888 bytes)Available download formats
    Dataset updated
    Mar 5, 2024
    Authors
    Hassane Skikri
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Hassane Skikri

    Released under CC0: Public Domain

    Contents

  13. r

    Semi-supervised data cleaning

    • resodate.org
    Updated Dec 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Mahdavi Lahijani (2020). Semi-supervised data cleaning [Dataset]. http://doi.org/10.14279/depositonce-10928
    Explore at:
    Dataset updated
    Dec 4, 2020
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Mohammad Mahdavi Lahijani
    Description

    Data cleaning is one of the most important but time-consuming tasks for data scientists. The data cleaning task consists of two major steps: (1) error detection and (2) error correction. The goal of error detection is to identify wrong data values. The goal of error correction is to fix these wrong values. Data cleaning is a challenging task due to the trade-off among correctness, completeness, and automation. In fact, detecting/correcting all data errors accurately without any user involvement is not possible for every dataset. We propose a novel data cleaning approach that detects/corrects data errors with a novel two-step task formulation. The intuition is that, by collecting a set of base error detectors/correctors that can independently mark/fix data errors, we can learn to combine them into a final set of data errors/corrections using a few informative user labels. First, each base error detector/corrector generates an initial set of potential data errors/corrections. Then, the approach ensembles the output of these base error detectors/correctors into one final set of data errors/corrections in a semi-supervised manner. In fact, the approach iteratively asks the user to annotate a tuple, i.e., marking/fixing a few data errors. The approach learns to generalize the user-provided error detection/correction examples to the rest of the dataset, accordingly. Our novel two-step formulation of the error detection/correction task has four benefits. First, the approach is configuration free and does not need any user-provided rules or parameters. In fact, the approach considers the base error detectors/correctors as black-box algorithms that are not necessarily correct or complete. Second, the approach is effective in the error detection/correction task as its first and second steps maximize recall and precision, respectively. Third, the approach also minimizes human involvement as it samples the most informative tuples of the dataset for user labeling. Fourth, the task formulation of our approach allows us to leverage previous data cleaning efforts to optimize the current data cleaning task. We design an end-to-end data cleaning pipeline according to this approach that takes a dirty dataset as input and outputs a cleaned dataset. Our pipeline leverages user feedback, a set of data cleaning algorithms, and a set of previously cleaned datasets, if available. Internally, our pipeline consists of an error detection system (named Raha), an error correction system (named Baran), and a transfer learning engine. As our extensive experiments show, our data cleaning systems are effective and efficient, and involve the user minimally. Raha and Baran significantly outperform existing data cleaning approaches in terms of effectiveness and human involvement on multiple well-known datasets.

  14. w

    Synthetic Data for an Imaginary Country, Sample, 2023 - World

    • microdata.worldbank.org
    • nada-demo.ihsn.org
    Updated Jul 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
    Explore at:
    Dataset updated
    Jul 7, 2023
    Dataset authored and provided by
    Development Data Group, Data Analytics Unit
    Time period covered
    2023
    Area covered
    World
    Description

    Abstract

    The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

    The full-population dataset (with about 10 million individuals) is also distributed as open data.

    Geographic coverage

    The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

    Analysis unit

    Household, Individual

    Universe

    The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

    Kind of data

    ssd

    Sampling procedure

    The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

    Mode of data collection

    other

    Research instrument

    The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

    Cleaning operations

    The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

    Response rate

    This is a synthetic dataset; the "response rate" is 100%.

  15. Dirty Data Sample

    • kaggle.com
    zip
    Updated Feb 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiva Vashishtha (2022). Dirty Data Sample [Dataset]. https://www.kaggle.com/datasets/shivavashishtha/dirty-data-sample
    Explore at:
    zip(52182 bytes)Available download formats
    Dataset updated
    Feb 22, 2022
    Authors
    Shiva Vashishtha
    Description

    Dataset

    This dataset was created by Shiva Vashishtha

    Contents

  16. i

    Household Expenditure and Income Survey 2008, Economic Research Forum (ERF)...

    • catalog.ihsn.org
    Updated Jan 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Statistics (2022). Household Expenditure and Income Survey 2008, Economic Research Forum (ERF) Harmonization Data - Jordan [Dataset]. https://catalog.ihsn.org/index.php/catalog/7661
    Explore at:
    Dataset updated
    Jan 12, 2022
    Dataset authored and provided by
    Department of Statistics
    Time period covered
    2008 - 2009
    Area covered
    Jordan
    Description

    Abstract

    The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.

    Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty

    Geographic coverage

    National

    Analysis unit

    • Household/families
    • Individuals

    Universe

    The survey covered a national sample of households and all individuals permanently residing in surveyed households.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2008 Household Expenditure and Income Survey sample was designed using two-stage cluster stratified sampling method. In the first stage, the primary sampling units (PSUs), the blocks, were drawn using probability proportionate to the size, through considering the number of households in each block to be the block size. The second stage included drawing the household sample (8 households from each PSU) using the systematic sampling method. Fourth substitute households from each PSU were drawn, using the systematic sampling method, to be used on the first visit to the block in case that any of the main sample households was not visited for any reason.

    To estimate the sample size, the coefficient of variation and design effect in each subdistrict were calculated for the expenditure variable from data of the 2006 Household Expenditure and Income Survey. This results was used to estimate the sample size at sub-district level, provided that the coefficient of variation of the expenditure variable at the sub-district level did not exceed 10%, with a minimum number of clusters that should not be less than 6 at the district level, that is to ensure good clusters representation in the administrative areas to enable drawing poverty pockets.

    It is worth mentioning that the expected non-response in addition to areas where poor families are concentrated in the major cities were taken into consideration in designing the sample. Therefore, a larger sample size was taken from these areas compared to other ones, in order to help in reaching the poverty pockets and covering them.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    List of survey questionnaires: (1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form

    Cleaning operations

    Raw Data The design and implementation of this survey procedures were: 1. Sample design and selection 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals 3. Design the tables template to be used for the dissemination of the survey results 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks 5. Selection and training of survey staff to collect data and run required data checkings 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results 7. Data collection 8. Data checking and coding 9. Data entry 10. Data cleaning using data validation programs 11. Data accuracy and consistency checks 12. Data tabulation and preliminary results 13. Preparation of the final report and dissemination of final results

    Harmonized Data - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets - The harmonization process started with cleaning all raw data files received from the Statistical Office - Cleaned data files were then all merged to produce one data file on the individual level containing all variables subject to harmonization - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables - A post-harmonization cleaning process was run on the data - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format

  17. Employment Of India CLeaned and Messy Data

    • kaggle.com
    zip
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MANSI SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
    Explore at:
    zip(29791 bytes)Available download formats
    Dataset updated
    Apr 7, 2025
    Authors
    MANSI SHINDE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

    🔹 Dataset Composition:

    It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

    Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
    - Employment Status (Employed/Unemployed)
    - Monthly Salary (INR)
    - Education Level
    - Industry Sector
    - Years of Experience
    - Location
    - Perceived AI Risk
    - Date of Data Recording

    Transformations & Cleaning Applied:

    The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

    Purpose & Utility:

    This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

    It's also useful for: - Training ML models with clean inputs
    - Data storytelling with visual clarity
    - Demonstrating reproducibility in data cleaning pipelines

    By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

  18. Fused dataset

    • zenodo.org
    • data.europa.eu
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Malin; Ben Malin; Tatiana Kalganova; Tatiana Kalganova (2025). Fused dataset [Dataset]. http://doi.org/10.5281/zenodo.14793207
    Explore at:
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ben Malin; Ben Malin; Tatiana Kalganova; Tatiana Kalganova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The current version of the TOD (Task Oriented Dialogues) fused dataset contains samples from MultiWOZ2.2 (Zang et al., 2020), SpokenWOZ (Si et al., 2023), FRAMES (Asri et al., 2017), DSTC3 (Henderson et al., 2014a) and SGD (Rastogi et al., 2020) datasets. These datasets have been selected due to them all being high quality, with significant human validation and data cleaning. Additionally, this selection of datasets provides coverage across unique attributes, such as utterance-level audio files (Si et al., 2023).

    The fused dataset requires several domains, necessitated by the scope of ELOQUENCE project (https://eloquenceai.eu) and the individual pilots. These datasets are stored using the ‘.arrow’ file extension so that speed and efficiency of data loading is optimised, as well as being compliant with the popular HuggingFace dataset library (HuggingFace, 2024). The dataset is also available at https://huggingface.co/datasets/Brunel-AI/ELOQUENCE. Currently, several datasets have been implemented within this fused dataset. However, due to the flexibility with which the schema has been defined, there is scope for additional datasets to be implemented across later iterations as further needs are identified. The JSON schema, as well as further explanation for attributes across all domains, is provided within Appendix 10.1 in ELOQUENCE deliverable 1.1.

  19. f

    Description of Demographic Characteristics of Samples After Data Cleaning.

    • datasetcatalog.nlm.nih.gov
    Updated Mar 3, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MacDonald, Douglas A.; Cheong, Hye Wook; Mohan, K. Krishna; Holland, Daniel; Gubrij, Zuzana Ondriasova; Salagame, Kiran Kumar K.; Brewczynski, Jacek; Friedman, Harris L. (2015). Description of Demographic Characteristics of Samples After Data Cleaning. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001931232
    Explore at:
    Dataset updated
    Mar 3, 2015
    Authors
    MacDonald, Douglas A.; Cheong, Hye Wook; Mohan, K. Krishna; Holland, Daniel; Gubrij, Zuzana Ondriasova; Salagame, Kiran Kumar K.; Brewczynski, Jacek; Friedman, Harris L.
    Description

    Note. Where frequencies do not add up to the total sample size, data were missing. For religious affiliation, C = Christianity, J = Judaism, I = Islam, H = Hinduism, B = Buddhism, O = Other Religion, N = No Religion. No Religion includes both participants who self-identified as not having any religious affiliation and those who did not report anything to the item on religious affiliation.Description of Demographic Characteristics of Samples After Data Cleaning.

  20. c

    Sephora Makeup Dataset – Free Beauty Product CSV

    • crawlfeeds.com
    csv, zip
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Sephora Makeup Dataset – Free Beauty Product CSV [Dataset]. https://crawlfeeds.com/datasets/sephora-sample-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Dec 2, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Looking for a free dataset of cosmetic products? The Sephora Makeup Products Sample Dataset provides a ready-to-use CSV of beauty product data containing 340 verified Sephora makeup product records. It includes details like product name, brand, price, ingredients, availability, user reviews count, and images - perfect for e-commerce research, market analysis, price tracking, or building machine-learning and recommendation systems for the beauty industry.

    Key Features

    • Complete Product Metadata: Each record includes URL, product name, brand, price, SKU, ingredients, product description, usage instructions, review count, image links, availability status, and more.
    • CSV Format: Ready to Use: Download instantly without any scraping or data cleaning required.
    • Ideal for Beauty-Tech & ML Projects: Useful for price comparison tools, recommendation engines, product cataloging, trend analysis, sentiment analysis based on reviews/ratings.
    • Free Sample Access: This sample comes at zero cost (USD $0.0) — an excellent starting point for analysts, developers, or researchers.

    This dataset is perfect for market research, price tracking, sentiment analysis, and AI-based recommendation systems. Whether you're an e-commerce retailer, a data analyst, or a machine learning professional, this dataset provides valuable insights into the beauty industry.

    Explore the Beauty and Cosmetics Data Collection and elevate your data-driven strategies today!

    Who Can Use This Dataset?

    • E-commerce analysts/retailers analyzing cosmetic product catalogs and pricing.
    • Data scientists / ML engineers building recommendation engines or product-based machine-learning models.
    • Market researchers & beauty industry analysts tracking brand/product trends, availability, and consumer preferences.
    • Students/hobby developers exploring beauty-tech projects, demo analyses, or building portfolios with real-world data.

    Why This Sephora Dataset?

    • Skip the hassle: no need for manual scraping or dealing with anti-scraping restrictions.
    • Clean, structured data - ready for immediate integration with tools or pipelines.
    • Free and accessible: great for testing, proof-of-concept or small-scale analysis.
    • Beauty industry focus: concentrated on makeup and cosmetics products - ideal for niche analyses or applications.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amrutha yenikonda (2023). Dirty Dataset to practice Data Cleaning [Dataset]. https://www.kaggle.com/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning
Organization logo

Dirty Dataset to practice Data Cleaning

List of highest grossing music tours by Women

Explore at:
zip(1241 bytes)Available download formats
Dataset updated
Nov 3, 2023
Authors
Amrutha yenikonda
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

The dataset has been obtained by web scraping a Wikipedia page for which code is linked below: https://www.kaggle.com/amruthayenikonda/simple-web-scraping-using-pandas

This dataset can be used to practice data cleaning and manipulation for example dropping of unwanted columns, null vales, removing symbols etc

Search
Clear search
Close search
Google apps
Main menu