13 datasets found
  1. Employment Of India CLeaned and Messy Data

    • kaggle.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SONIA SHINDE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

    🔹 Dataset Composition:

    It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

    Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
    - Employment Status (Employed/Unemployed)
    - Monthly Salary (INR)
    - Education Level
    - Industry Sector
    - Years of Experience
    - Location
    - Perceived AI Risk
    - Date of Data Recording

    Transformations & Cleaning Applied:

    The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

    Purpose & Utility:

    This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

    It's also useful for: - Training ML models with clean inputs
    - Data storytelling with visual clarity
    - Demonstrating reproducibility in data cleaning pipelines

    By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

  2. f

    Data from: Integrating Data Transformation in Principal Components Analysis

    • tandf.figshare.com
    pdf
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehdi Maadooliat; Jianhua Z. Huang; Jianhua Hu (2023). Integrating Data Transformation in Principal Components Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.960499.v3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Mehdi Maadooliat; Jianhua Z. Huang; Jianhua Hu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Principal component analysis (PCA) is a popular dimension-reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated using simulated and real-world data examples. Supplementary materials for this article are available online.

  3. f

    Evaluating Functional Diversity: Missing Trait Data and the Importance of...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Májeková; Taavi Paal; Nichola S. Plowman; Michala Bryndová; Liis Kasari; Anna Norberg; Matthias Weiss; Tom R. Bishop; Sarah H. Luke; Katerina Sam; Yoann Le Bagousse-Pinguet; Jan Lepš; Lars Götzenberger; Francesco de Bello (2023). Evaluating Functional Diversity: Missing Trait Data and the Importance of Species Abundance Structure and Data Transformation [Dataset]. http://doi.org/10.1371/journal.pone.0149270
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Maria Májeková; Taavi Paal; Nichola S. Plowman; Michala Bryndová; Liis Kasari; Anna Norberg; Matthias Weiss; Tom R. Bishop; Sarah H. Luke; Katerina Sam; Yoann Le Bagousse-Pinguet; Jan Lepš; Lars Götzenberger; Francesco de Bello
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Functional diversity (FD) is an important component of biodiversity that quantifies the difference in functional traits between organisms. However, FD studies are often limited by the availability of trait data and FD indices are sensitive to data gaps. The distribution of species abundance and trait data, and its transformation, may further affect the accuracy of indices when data is incomplete. Using an existing approach, we simulated the effects of missing trait data by gradually removing data from a plant, an ant and a bird community dataset (12, 59, and 8 plots containing 62, 297 and 238 species respectively). We ranked plots by FD values calculated from full datasets and then from our increasingly incomplete datasets and compared the ranking between the original and virtually reduced datasets to assess the accuracy of FD indices when used on datasets with increasingly missing data. Finally, we tested the accuracy of FD indices with and without data transformation, and the effect of missing trait data per plot or per the whole pool of species. FD indices became less accurate as the amount of missing data increased, with the loss of accuracy depending on the index. But, where transformation improved the normality of the trait data, FD values from incomplete datasets were more accurate than before transformation. The distribution of data and its transformation are therefore as important as data completeness and can even mitigate the effect of missing data. Since the effect of missing trait values pool-wise or plot-wise depends on the data distribution, the method should be decided case by case. Data distribution and data transformation should be given more careful consideration when designing, analysing and interpreting FD studies, especially where trait data are missing. To this end, we provide the R package “traitor” to facilitate assessments of missing trait data.

  4. H

    Data from: Managers' and physicians’ perception of palm vein technology...

    • dataverse.harvard.edu
    Updated Nov 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cruz Cerda III (2019). Data from: Managers' and physicians’ perception of palm vein technology adoption in the healthcare industry (Preprint) and Medical Identity Theft and Palm Vein Authentication: The Healthcare Manager's Perspective (Doctoral Dissertation) [Dataset]. http://doi.org/10.7910/DVN/RSPAZQ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Cruz Cerda III
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RSPAZQhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RSPAZQ

    Description

    Data from: Doctoral dissertation; Preprint article entitled: Managers' and physicians’ perception of palm vein technology adoption in the healthcare industry. Formats of the files associated with dataset: CSV; SAV. SPSS setup files can be used to generate native SPSS file formats such as SPSS system files and SPSS portable files. SPSS setup files generally include the following SPSS sections: DATA LIST: Assigns the name, type, decimal specification (if any), and specifies the beginning and ending column locations for each variable in the data file. Users must replace the "physical-filename" with host computer-specific input file specifications. For example, users on Windows platforms should replace "physical-filename" with "C:\06512-0001-Data.txt" for the data file named "06512-0001-Data.txt" located on the root directory "C:". VARIABLE LABELS: Assigns descriptive labels to all variables. Variable labels and variable names may be identical for some variables. VALUE LABELS: Assigns descriptive labels to codes in the data file. Not all variables necessarily have assigned value labels. MISSING VALUES: Declares user-defined missing values. Not all variables in the data file necessarily have user-defined missing values. These values can be treated specially in data transformations, statistical calculations, and case selection. MISSING VALUE RECODE: Sets user-defined numeric missing values to missing as interpreted by the SPSS system. Only variables with user-defined missing values are included in the statements. ABSTRACT: The purpose of the article is to examine the factors that influence the adoption of palm vein technology by considering the healthcare managers’ and physicians’ perception, using the Unified Theory of Acceptance and Use of Technology theoretical foundation. A quantitative approach was used for this study through which an exploratory research design was utilized. A cross-sectional questionnaire was distributed to responders who were managers and physicians in the healthcare industry and who had previous experience with palm vein technology. The perceived factors tested for correlation with adoption were perceived usefulness, complexity, security, peer influence, and relative advantage. A Pearson product-moment correlation coefficient was used to test the correlation between the perceived factors and palm vein technology. The results showed that perceived usefulness, security, and peer influence are important factors for adoption. Study limitations included purposive sampling from a single industry (healthcare) and limited literature was available with regard to managers’ and physicians’ perception of palm vein technology adoption in the healthcare industry. Researchers could focus on an examination of the impact of mediating variables on palm vein technology adoption in future studies. The study offers managers insight into the important factors that need to be considered in adopting palm vein technology. With biometric technology becoming pervasive, the study seeks to provide managers with the insight in managing the adoption of palm vein technology. KEYWORDS: biometrics, human identification, image recognition, palm vein authentication, technology adoption, user acceptance, palm vein technology

  5. Eel data (Anguilla anguilla) and associated environment variables for eel in...

    • zenodo.org
    bin
    Updated Nov 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cédric Briand; Cédric Briand; María Mateo; María Mateo; Hilaire Drouineau; María Korta; Estibaliz Díaz; Laurent Beaulaton; Hilaire Drouineau; María Korta; Estibaliz Díaz; Laurent Beaulaton (2023). Eel data (Anguilla anguilla) and associated environment variables for eel in the SUDOE area (SUDOANG project) [Dataset]. http://doi.org/10.5281/zenodo.6397009
    Explore at:
    binAvailable download formats
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cédric Briand; Cédric Briand; María Mateo; María Mateo; Hilaire Drouineau; María Korta; Estibaliz Díaz; Laurent Beaulaton; Hilaire Drouineau; María Korta; Estibaliz Díaz; Laurent Beaulaton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1. DESCRIPTION

    1.1. THE SUDOANG PROJECT

    The SUDOANG project aims at providing common tools to managers to support eel conservation in the SUDOE area (Spain, France and Portugal).

    Three main datasets have been used to implement EDA.

    • A database of rivers and their attributes along with tools for chaining
    • A database of dams.
    • A database of electrofishing (current dataset)

    Electrofishing data include site locations, fishing operations which can be done several times at one site, and fish data collected during each operation. The operations are attached to stretches of river or river segments whose characteristics describe the conditions for presence, density, size structure, or silvering rate of the eels.

    The electrofishing operation are classified by type:

    • com full two pass fishing
    • coa full two pass electrofishing for eel
    • iaa eel abundance point sampling
    • ber bank sampling
    • gm point fishing for large streams
    • oth other, or unspecified.

    1.2. TEMPORAL SCOPE

    From 1985 to 2018, beware incomplete dataset after 2015 in France.

    1.3. GEOGRAPHICAL RANGE

    The SUDOE area including Iberian Peninsula and France.

    2. DATASETS DESCRIPTION

    2.1. DENSITIES AND PRESENCE ABSENCE

    Dataset: frsppt_12_2020.Rdata

    Most variables are built here.

    The script to create cumulated values for dams can be found here.

    For a technical description see the report.

    This file contains the following datasets:

    • ddd => dataset used to calibrate the presence absence model, 46147 lines
    • ddg => dataset used to calibrate the gamma model (only positive values retained), 19993 lines.
    • tdd => dataset corresponding to places where transport operation have been identified, presence absence model, 6582 lines.
    • tdg => dataset corresponding to places where transport operation have been identified, gamma model (only positive values) 984 lines.

    And the following columns (in alphabetical order):

    • altitudem: Altitude in meter truncated to 800 m
    • area_sudo: Area for recruitment (Drouineau et al., 2021)
    • codesea: Code of the Sea, factor A = Atlantic, M = Mediterranean
    • country: A factor (FR, SP, PT) for France, Spain and Portugal
    • country2: Country with grouping for the Iberian Peninsula (SPPT). Other level is France (FR)
    • cs_height_08_n: Cumulated height from the sea, dam height transformed with power 0.8, no prediction for missing values
    • cs_height_08_n: Same variable but truncated to 300
    • cs_height_08_p: Cumulated height from the sea, dam height transformed with power 0.8, with prediction for missing values
    • cs_height_08_p: Same variable but truncated to 300
    • cs_height_08_pp: Cumulated height from the sea, dam height transformed with power 0.8, with prediction for missing values, the height of dam is set to zero if equiped with an efficient fishway for eel
    • cs_height_08_pps: Cumulated height from the sea, dam height transformed with power 0.8, with prediction for missing values, the height of dam is set to zero if a score of efficient passage was attributed for eel on this structure
    • cs_height_10_FR: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams from France are considered when building on a transnational water course
    • cs_height_10_n: Cumulated height from the sea, no transformation, no prediction for missing values
    • cs_height_10_n: Same variable but truncated to 200
    • cs_height_10_p: Cumulated height from the sea, no transformation, missing height are extrapolated from two different models in France and the Iberian Peninsula
    • cs_height_10_p: Same variable but truncated to 200
    • cs_height_10_pass0: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams without pass are used to build the cumulated value
    • cs_height_10_pass1: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams with pass are used to build the cumulated value
    • cs_height_10_pp: Cumulated height from the sea, no transformation, with prediction for missing values, the height of dam is set to zero if equiped with an efficient fishway for eel
    • cs_height_10_ppass0: Cumulated height from the sea, no transformation, including prediction for missing values, only the dams without pass are used to build the cumulated value
    • cs_height_10_ppass1: Cumulated height from the sea, no transformation, including prediction for missing values, only the dams with pass are used to build the cumulated value
    • cs_height_10_pps: Cumulated height from the sea, no transformation, with prediction for missing values, the height of dam is set to zero if a score of efficient passage was attributed for eel on this structure
    • cs_height_10_pscore0: Cumulated height from the sea, no transformation, including prediction for missing values, only the dams without score are used to build the cumulated value
    • cs_height_10_pscore1: Cumulated height from the sea, no transformation, including prediction for missing values, only the dams with score (that have been expertised as no or small barrier for eel) are used to build the cumulated value
    • cs_height_10_PT: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams from Portugal are considered when building on a transnational water course
    • cs_height_10_score0: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams without score are used to build the cumulated value
    • cs_height_10_score1: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams with score (that have been expertised as no or small barrier for eel) are used to build the cumulated value
    • cs_height_10_SP: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams from Spain are considered when building on a transnational water course.
    • cs_height_12_n: Cumulated height from the sea, dam height transformed with power 1.2, no prediction for missing values.
    • cs_height_12_n.: Same variable but truncated to 500
    • cs_height_12_p: Cumulated height from the sea, dam height transformed with power 1.2, with prediction for missing values.
    • cs_height_12_p.: Same variable but truncated to 500
    • cs_height_12_pp: Cumulated height from the sea, dam height transformed with power 1.2, with prediction for missing values, the height of dam is set to zero if equiped with an efficient fishway for eel
    • cs_height_12_pps: Cumulated height from the sea, dam height transformed with power 1.2, with prediction for missing values, the height of dam is set to zero if a score of efficient passage was attributed for eel on this structure
    • cs_height_15_n: Cumulated height from the sea, dam height transformed with power 1.5, no prediction for missing values
    • cs_height_15_n.: Same variable but truncated to 800
    • cs_height_15_p: Cumulated height from the sea, dam height transformed with power 1.5, with prediction for missing values
    • cs_height_15_p.: Same variable but truncated to 800
    • cs_height_15_pp: Cumulated height from the sea, dam height transformed with power 1.5, with prediction for missing values, the height of dam is set to zero if equiped with an efficient fishway for eel
    • cs_height_15_pps: Cumulated height from the sea, dam height transformed with power 1.5, with prediction for missing values, the height of dam is set to zero if a score of efficient passage was attributed for eel on this structure
    • cumnbdamp: Cumulated number of dam from the sea
    • cumnbdamso: duplicate of cumnbdamp
    • cumwettedsurfacebothkm2.: Surface of water downstream from the segment in the river. Corresponds to both riversegment and waterbodies
    • cumwettedsurfacekm2.: Surface of water downstream from the segment in the river. Corresponds only to rivers
    • cumwettedsurfaceotherkm2.: Surface of water downstream from the segment in the river. Corresponds only waterbodies (water surfaces, associated with the segment).
    • densCS: Density from Carle and Strub, number in second pass extrapolated from efficiency if only one pass
    • dist_from_gibraltar_km: Distance to Gibraltar calculated using an enveloppe along the coastline (e.g. the estuaries ingress inland are not counted for this distance).
    • distanceseakm: Distance to the sea
    • distanceseakm.: Distance to the sea, truncated at 500
    • distancesourcem: distance to the source in meters
    • downstdrainagewettedsurfaceboth.: Percentage of wetted surface

  6. Housing Price Analysis and Prediction

    • kaggle.com
    Updated Feb 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Reda Elblgihy (2024). Housing Price Analysis and Prediction [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/housing-price-analysis-and-prediction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ali Reda Elblgihy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Steps Throughout the Full Project:

    1- Initial Data Exploration: Introduction to the dataset and its variables. Identification of potential relationships between variables. Examination of data quality issues such as missing values and outliers.

    2- Correlation Analysis: Utilization of correlation matrices and heatmaps to identify relationships between variables. Focus on variables highly correlated with the target variable, 'SalePrice'.

    3- Handling Missing Data: Analysis of missing data prevalence and patterns. Deletion of variables with high percentages of missing data. Treatment of missing observations for remaining variables based on their importance.

    4- Dealing with Outliers: Identification and handling of outliers using data visualization and statistical methods. Removal of outliers that significantly deviate from the overall pattern.

    5- Testing Statistical Assumptions: Assessment of normality, homoscedasticity, linearity, and absence of correlated errors. Application of data transformations to meet statistical assumptions.

    6- Conversion of Categorical Variables: Conversion of categorical variables into dummy variables to prepare for modeling.

    Summary: The project undertook a comprehensive analysis of housing price data, encompassing data exploration, correlation analysis, missing data handling, outlier detection, and testing of statistical assumptions. Through visualization and statistical methods, the project identified key relationships between variables and prepared the data for predictive modeling.

    Recommendations: Further exploration of advanced modeling techniques such as regularized linear regression and ensemble methods for predicting housing prices. Consideration of additional variables or feature engineering to improve model performance. Evaluation of model performance using cross-validation and other validation techniques. Documentation and communication of findings and recommendations for stakeholders or further research.

  7. Effect of sampling scenario and abundance transformation on FD index...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Májeková; Taavi Paal; Nichola S. Plowman; Michala Bryndová; Liis Kasari; Anna Norberg; Matthias Weiss; Tom R. Bishop; Sarah H. Luke; Katerina Sam; Yoann Le Bagousse-Pinguet; Jan Lepš; Lars Götzenberger; Francesco de Bello (2023). Effect of sampling scenario and abundance transformation on FD index sensitivity. [Dataset]. https://plos.figshare.com/articles/dataset/Effect_of_sampling_scenario_and_abundance_transformation_on_FD_index_sensitivity_/2296009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Maria Májeková; Taavi Paal; Nichola S. Plowman; Michala Bryndová; Liis Kasari; Anna Norberg; Matthias Weiss; Tom R. Bishop; Sarah H. Luke; Katerina Sam; Yoann Le Bagousse-Pinguet; Jan Lepš; Lars Götzenberger; Francesco de Bello
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Effect of sampling scenario and abundance transformation on FD index sensitivity.

  8. m

    Transformed Customer Shopping Dataset with Advanced Feature Engineering and...

    • data.mendeley.com
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Zinnahtur Rahman Zitu (2025). Transformed Customer Shopping Dataset with Advanced Feature Engineering and Anonymization [Dataset]. http://doi.org/10.17632/fnhyc6drm8.1
    Explore at:
    Dataset updated
    Jul 21, 2025
    Authors
    Md Zinnahtur Rahman Zitu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset represents a thoroughly transformed and enriched version of a publicly available customer shopping dataset. It has undergone comprehensive processing to ensure it is clean, privacy-compliant, and enriched with new features, making it highly suitable for advanced analytics, machine learning, and business research applications.

    The transformation process focused on creating a high-quality dataset that supports robust customer behavior analysis, segmentation, and anomaly detection, while maintaining strict privacy through anonymization and data validation.

    ➡ Data Cleaning and Preprocessing : Duplicates were removed. Missing numerical values (Age, Purchase Amount, Review Rating) were filled with medians; missing categorical values labeled “Unknown.” Text data were cleaned and standardized, and numeric fields were clipped to valid ranges.

    ➡ Feature Engineering : New informative variables were engineered to augment the dataset’s analytical power. These include: • Avg_Amount_Per_Purchase: Average purchase amount calculated by dividing total purchase value by the number of previous purchases, capturing spending behavior per transaction. • Age_Group: Categorical age segmentation into meaningful bins such as Teen, Young Adult, Adult, Senior, and Elder. • Purchase_Frequency_Score: Quantitative mapping of purchase frequency to annualized values to facilitate numerical analysis. • Discount_Impact: Monetary quantification of discount application effects on purchases. • Processing_Date: Timestamp indicating the dataset transformation date for provenance tracking.

    ➡ Data Filtering : Rows with ages outside 0–100 were removed. Only core categories (Clothing, Footwear, Outerwear, Accessories) and the top 25% of high-value customers by purchase amount were retained for focused analysis.

    ➡ Data Transformation : Key numeric features were standardized, and log transformations were applied to skewed data to improve model performance.

    ➡ Advanced Features : Created a category-wise average purchase and a loyalty score combining purchase frequency and volume.

    ➡ Segmentation & Anomaly Detection : Used KMeans to cluster customers into four groups and Isolation Forest to flag anomalies.

    ➡ Text Processing : Cleaned text fields and added a binary indicator for clothing items.

    ➡ Privacy : Hashed Customer ID and removed sensitive columns like Location to ensure privacy.

    ➡ Validation : Automated checks for data integrity, including negative values and valid ranges.

    This transformed dataset supports a wide range of research and practical applications, including customer segmentation, purchase behavior modeling, marketing strategy development, fraud detection, and machine learning education. It serves as a reliable and privacy-aware resource for academics, data scientists, and business analysts.

  9. d

    Hourly wind speed in miles per hour and associated three-digit data-source...

    • catalog.data.gov
    • search.dataone.org
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Hourly wind speed in miles per hour and associated three-digit data-source flag, January 1, 1948 - September 30, 2016 [Dataset]. https://catalog.data.gov/dataset/hourly-wind-speed-in-miles-per-hour-and-associated-three-digit-data-source-flag-january-30
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    The text file "Wind speed.txt" contains hourly data and associated data-source flag from January 1, 1948, to September 30, 2016. The primary source of the data is the Argonne National Laboratory, Illinois (ANL). The data-source flag consist of a three-digit sequence in the form "xyz" that describe the origin and transformations of the data values. They indicate if the data are original or missing, the method that was used to fill the missing periods, and any other transformations of the data. Missing and apparently erroneous data values were replaced with adjusted values from nearby stations used as “backup”. As stated in Over and others (2010), temporal variations in the statistical properties of the data resulting from changes in measurement and data storage methodologies were adjusted to match the statistical properties resulting from the data collection procedures that have been in place since January 1, 1989. The adjustments were computed based on the regressions between the primary data series from ANL and the backup series using data obtained during common periods; the statistical properties of the regressions were used to assign estimated standard errors to values that were adjusted or filled from other series. Each hourly value is assigned a corresponding data source flag that indicates the source of the value and its transformations. As described in Over and others (2010), each flag is of the form "xyz" that allows the user to determine its source and the methods used to process the data. During the period 01/09/2016 hour 21 to 01/10/2016 hour 24 both ANL and the primary backup station at St. Charles, Illinois had missing wind speed data. The O'Hare International Airport (ORD) is used as an alternate backup station and the new regression equation and the corresponding new flag for wind speed are established using daily wind data from ORD for the period 10/01/2007 through 09/30/2016 following the guideline described in Over and others (2010). Reference Cited: Over, T.M., Price, T.H., and Ishii, A.L., 2010, Development and analysis of a meteorological database, Argonne National Laboratory, Illinois: U.S. Geological Survey Open File Report 2010-1220, 67 p., http://pubs.usgs.gov/of/2010/1220/.

  10. f

    Supplement 1. A file containing 471 data sets compiled from the literature...

    • wiley.figshare.com
    html
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiao Xiao; Ethan P. White; Mevin B. Hooten; Susan L. Durham (2023). Supplement 1. A file containing 471 data sets compiled from the literature describing power-law relationships in ecology, evolution, and physiology. [Dataset]. http://doi.org/10.6084/m9.figshare.3551976.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Wiley
    Authors
    Xiao Xiao; Ethan P. White; Mevin B. Hooten; Susan L. Durham
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    File List Sup_1_Data.csv Description The Sup_1_Data.csv file contains 471 data sets compiled from literature describing power-law relationships between two variables in ecology, evolution, and physiology. For the full list of citations, see Appendix B of this paper. Column definitions:

        Dataset ID
        x – independent variable in the original dataset
        y – dependent variable in the original dataset
    
      Checksum values are:
    
        Column 1 (Dataset ID): SUM = 4575923; 0 missing values (rows with data: 24902)
        Column 2 (independent variable x): SUM = 1413965769; 0 missing values (rows with data: 24902)
        Column 3 (dependent variable y): SUM = 2137944719097.652; 0 missing values (row with data: 24902)
    
  11. Superstore Sales Analysis

    • kaggle.com
    Updated Oct 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ali Reda Elblgihy
    Description

    Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

    1- Data Import and Transformation:

    • Gather and import relevant sales data from various sources into Excel.
    • Utilize Power Query to clean, transform, and structure the data for analysis.
    • Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

    2- Data Quality Assessment:

    • Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.
    • Standardize data formats and ensure that all data is in a consistent, usable state.

    3- Calculating COGS:

    • Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.
    • Apply appropriate formulas and calculations to determine COGS accurately.

    4- Discount Analysis:

    • Analyze the discount values offered on products to understand their impact on sales and profitability.
    • Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

    5- Sales Metrics:

    • Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.
    • Utilize Excel functions to compute these metrics and create visuals for better insights.

    6- Visualization:

    • Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.
    • Visual representations can help identify trends, outliers, and patterns in the data.

    7- Report Generation:

    • Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

    Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.

  12. Z

    EA-MD-QD: Large Euro Area and Euro Member Countries Datasets for...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Barigozzi, Matteo (2025). EA-MD-QD: Large Euro Area and Euro Member Countries Datasets for Macroeconomic Research [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10514667
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    Barigozzi, Matteo
    Lissona, Claudio
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    EA-MD-QD is a collection of large monthly and quarterly EA and EA member countries datasets for macroeconomic analysis.The EA member countries covered are: AT, BE, DE, EL, ES, FR, IE, IT, NL, PT.

    The formal reference to this dataset is:

    Barigozzi, M. and Lissona, C. (2024) "EA-MD-QD: Large Euro Area and Euro Member Countries Datasets for Macroeconomic Research". Zenodo.

    Please refer to it when using the data.

    Each zip file contains:- Excel files for the EA and the countries covered, each containing an unbalanced panel of raw de-seasonalized data.- A Matlab code taking as input the raw data and allowing to perform various operations such as:choose the frequency, fill-in missing values, transform data to stationarity, and control for covid outliers.- A pdf file with all informations about the series names, sources, and transformation codes.

    This version (03.2025):

    Updated data as of 28-March-2025. We improved the matlab code and included a ReadME file containing details on the parameters' choice from the user, which before were only briefly commented in the code.

  13. S

    Digitalisation and Green Transformation Synergies and Corporate Debt...

    • scidb.cn
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cheng xiao li (2024). Digitalisation and Green Transformation Synergies and Corporate Debt Financing Costs Dataset of China's Shanghai and Shenzhen A-share Listed Companies, 2007-2022 [Dataset]. http://doi.org/10.57760/sciencedb.18904
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 26, 2024
    Dataset provided by
    Science Data Bank
    Authors
    cheng xiao li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Shenzhen, Shanghai, China
    Description

    The dataset collects some data from China's Shanghai and Shenzhen A-share listed companies from 2007-2022 in the Cathay Pacific database and the Dibbo database, collects the annual reports of relevant companies in the Juchao information website, uses Python to perform text analysis, obtains the number of frequency of words for digitalisation and greening transformation, and performs data processing using excel to remove the missing values and perform the indentation process through the Stata software.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
Organization logo

Employment Of India CLeaned and Messy Data

From Chaos to Clarity: Cleaning Real-World Employment Data for Accurate Insights

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SONIA SHINDE
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Area covered
India
Description

This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

Search
Clear search
Close search
Google apps
Main menu