13 datasets found

Employment Of India CLeaned and Messy Data
kaggle.com
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SONIA SHINDE
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
f
Data from: Integrating Data Transformation in Principal Components Analysis
tandf.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehdi Maadooliat; Jianhua Z. Huang; Jianhua Hu (2023). Integrating Data Transformation in Principal Components Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.960499.v3
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.960499.v3
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francis
Authors
Mehdi Maadooliat; Jianhua Z. Huang; Jianhua Hu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Principal component analysis (PCA) is a popular dimension-reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated using simulated and real-world data examples. Supplementary materials for this article are available online.
f
Evaluating Functional Diversity: Missing Trait Data and the Importance of...
plos.figshare.com
datasetcatalog.nlm.nih.gov
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Májeková; Taavi Paal; Nichola S. Plowman; Michala Bryndová; Liis Kasari; Anna Norberg; Matthias Weiss; Tom R. Bishop; Sarah H. Luke; Katerina Sam; Yoann Le Bagousse-Pinguet; Jan Lepš; Lars Götzenberger; Francesco de Bello (2023). Evaluating Functional Diversity: Missing Trait Data and the Importance of Species Abundance Structure and Data Transformation [Dataset]. http://doi.org/10.1371/journal.pone.0149270
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0149270
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Maria Májeková; Taavi Paal; Nichola S. Plowman; Michala Bryndová; Liis Kasari; Anna Norberg; Matthias Weiss; Tom R. Bishop; Sarah H. Luke; Katerina Sam; Yoann Le Bagousse-Pinguet; Jan Lepš; Lars Götzenberger; Francesco de Bello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Functional diversity (FD) is an important component of biodiversity that quantifies the difference in functional traits between organisms. However, FD studies are often limited by the availability of trait data and FD indices are sensitive to data gaps. The distribution of species abundance and trait data, and its transformation, may further affect the accuracy of indices when data is incomplete. Using an existing approach, we simulated the effects of missing trait data by gradually removing data from a plant, an ant and a bird community dataset (12, 59, and 8 plots containing 62, 297 and 238 species respectively). We ranked plots by FD values calculated from full datasets and then from our increasingly incomplete datasets and compared the ranking between the original and virtually reduced datasets to assess the accuracy of FD indices when used on datasets with increasingly missing data. Finally, we tested the accuracy of FD indices with and without data transformation, and the effect of missing trait data per plot or per the whole pool of species. FD indices became less accurate as the amount of missing data increased, with the loss of accuracy depending on the index. But, where transformation improved the normality of the trait data, FD values from incomplete datasets were more accurate than before transformation. The distribution of data and its transformation are therefore as important as data completeness and can even mitigate the effect of missing data. Since the effect of missing trait values pool-wise or plot-wise depends on the data distribution, the method should be decided case by case. Data distribution and data transformation should be given more careful consideration when designing, analysing and interpreting FD studies, especially where trait data are missing. To this end, we provide the R package “traitor” to facilitate assessments of missing trait data.
H
Data from: Managers' and physicians’ perception of palm vein technology...
dataverse.harvard.edu
Updated Nov 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cruz Cerda III (2019). Data from: Managers' and physicians’ perception of palm vein technology adoption in the healthcare industry (Preprint) and Medical Identity Theft and Palm Vein Authentication: The Healthcare Manager's Perspective (Doctoral Dissertation) [Dataset]. http://doi.org/10.7910/DVN/RSPAZQ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/RSPAZQ
Dataset updated
Nov 4, 2019
Dataset provided by
Harvard Dataverse
Authors
Cruz Cerda III
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RSPAZQhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RSPAZQ
Description
Data from: Doctoral dissertation; Preprint article entitled: Managers' and physicians’ perception of palm vein technology adoption in the healthcare industry. Formats of the files associated with dataset: CSV; SAV. SPSS setup files can be used to generate native SPSS file formats such as SPSS system files and SPSS portable files. SPSS setup files generally include the following SPSS sections: DATA LIST: Assigns the name, type, decimal specification (if any), and specifies the beginning and ending column locations for each variable in the data file. Users must replace the "physical-filename" with host computer-specific input file specifications. For example, users on Windows platforms should replace "physical-filename" with "C:\06512-0001-Data.txt" for the data file named "06512-0001-Data.txt" located on the root directory "C:". VARIABLE LABELS: Assigns descriptive labels to all variables. Variable labels and variable names may be identical for some variables. VALUE LABELS: Assigns descriptive labels to codes in the data file. Not all variables necessarily have assigned value labels. MISSING VALUES: Declares user-defined missing values. Not all variables in the data file necessarily have user-defined missing values. These values can be treated specially in data transformations, statistical calculations, and case selection. MISSING VALUE RECODE: Sets user-defined numeric missing values to missing as interpreted by the SPSS system. Only variables with user-defined missing values are included in the statements. ABSTRACT: The purpose of the article is to examine the factors that influence the adoption of palm vein technology by considering the healthcare managers’ and physicians’ perception, using the Unified Theory of Acceptance and Use of Technology theoretical foundation. A quantitative approach was used for this study through which an exploratory research design was utilized. A cross-sectional questionnaire was distributed to responders who were managers and physicians in the healthcare industry and who had previous experience with palm vein technology. The perceived factors tested for correlation with adoption were perceived usefulness, complexity, security, peer influence, and relative advantage. A Pearson product-moment correlation coefficient was used to test the correlation between the perceived factors and palm vein technology. The results showed that perceived usefulness, security, and peer influence are important factors for adoption. Study limitations included purposive sampling from a single industry (healthcare) and limited literature was available with regard to managers’ and physicians’ perception of palm vein technology adoption in the healthcare industry. Researchers could focus on an examination of the impact of mediating variables on palm vein technology adoption in future studies. The study offers managers insight into the important factors that need to be considered in adopting palm vein technology. With biometric technology becoming pervasive, the study seeks to provide managers with the insight in managing the adoption of palm vein technology. KEYWORDS: biometrics, human identification, image recognition, palm vein authentication, technology adoption, user acceptance, palm vein technology
Eel data (Anguilla anguilla) and associated environment variables for eel in...
zenodo.org
bin
Updated Nov 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cédric Briand; Cédric Briand; María Mateo; María Mateo; Hilaire Drouineau; María Korta; Estibaliz Díaz; Laurent Beaulaton; Hilaire Drouineau; María Korta; Estibaliz Díaz; Laurent Beaulaton (2023). Eel data (Anguilla anguilla) and associated environment variables for eel in the SUDOE area (SUDOANG project) [Dataset]. http://doi.org/10.5281/zenodo.6397009
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6397009
Dataset updated
Nov 8, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cédric Briand; Cédric Briand; María Mateo; María Mateo; Hilaire Drouineau; María Korta; Estibaliz Díaz; Laurent Beaulaton; Hilaire Drouineau; María Korta; Estibaliz Díaz; Laurent Beaulaton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
1. DESCRIPTION
1.1. THE SUDOANG PROJECT
The SUDOANG project aims at providing common tools to managers to support eel conservation in the SUDOE area (Spain, France and Portugal).
Three main datasets have been used to implement EDA.
A database of rivers and their attributes along with tools for chaining
A database of dams.
A database of electrofishing (current dataset)
Electrofishing data include site locations, fishing operations which can be done several times at one site, and fish data collected during each operation. The operations are attached to stretches of river or river segments whose characteristics describe the conditions for presence, density, size structure, or silvering rate of the eels.
The electrofishing operation are classified by type:
com full two pass fishing
coa full two pass electrofishing for eel
iaa eel abundance point sampling
ber bank sampling
gm point fishing for large streams
oth other, or unspecified.
1.2. TEMPORAL SCOPE
From 1985 to 2018, beware incomplete dataset after 2015 in France.
1.3. GEOGRAPHICAL RANGE
The SUDOE area including Iberian Peninsula and France.
2. DATASETS DESCRIPTION
2.1. DENSITIES AND PRESENCE ABSENCE
Dataset: frsppt_12_2020.Rdata
Most variables are built here.
The script to create cumulated values for dams can be found here.
For a technical description see the report.
This file contains the following datasets:
ddd => dataset used to calibrate the presence absence model, 46147 lines
ddg => dataset used to calibrate the gamma model (only positive values retained), 19993 lines.
tdd => dataset corresponding to places where transport operation have been identified, presence absence model, 6582 lines.
tdg => dataset corresponding to places where transport operation have been identified, gamma model (only positive values) 984 lines.
And the following columns (in alphabetical order):
altitudem: Altitude in meter truncated to 800 m
area_sudo: Area for recruitment (Drouineau et al., 2021)
codesea: Code of the Sea, factor A = Atlantic, M = Mediterranean
country: A factor (FR, SP, PT) for France, Spain and Portugal
country2: Country with grouping for the Iberian Peninsula (SPPT). Other level is France (FR)
cs_height_08_n: Cumulated height from the sea, dam height transformed with power 0.8, no prediction for missing values
cs_height_08_n: Same variable but truncated to 300
cs_height_08_p: Cumulated height from the sea, dam height transformed with power 0.8, with prediction for missing values
cs_height_08_p: Same variable but truncated to 300
cs_height_08_pp: Cumulated height from the sea, dam height transformed with power 0.8, with prediction for missing values, the height of dam is set to zero if equiped with an efficient fishway for eel
cs_height_08_pps: Cumulated height from the sea, dam height transformed with power 0.8, with prediction for missing values, the height of dam is set to zero if a score of efficient passage was attributed for eel on this structure
cs_height_10_FR: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams from France are considered when building on a transnational water course
cs_height_10_n: Cumulated height from the sea, no transformation, no prediction for missing values
cs_height_10_n: Same variable but truncated to 200
cs_height_10_p: Cumulated height from the sea, no transformation, missing height are extrapolated from two different models in France and the Iberian Peninsula
cs_height_10_p: Same variable but truncated to 200
cs_height_10_pass0: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams without pass are used to build the cumulated value
cs_height_10_pass1: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams with pass are used to build the cumulated value
cs_height_10_pp: Cumulated height from the sea, no transformation, with prediction for missing values, the height of dam is set to zero if equiped with an efficient fishway for eel
cs_height_10_ppass0: Cumulated height from the sea, no transformation, including prediction for missing values, only the dams without pass are used to build the cumulated value
cs_height_10_ppass1: Cumulated height from the sea, no transformation, including prediction for missing values, only the dams with pass are used to build the cumulated value
cs_height_10_pps: Cumulated height from the sea, no transformation, with prediction for missing values, the height of dam is set to zero if a score of efficient passage was attributed for eel on this structure
cs_height_10_pscore0: Cumulated height from the sea, no transformation, including prediction for missing values, only the dams without score are used to build the cumulated value
cs_height_10_pscore1: Cumulated height from the sea, no transformation, including prediction for missing values, only the dams with score (that have been expertised as no or small barrier for eel) are used to build the cumulated value
cs_height_10_PT: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams from Portugal are considered when building on a transnational water course
cs_height_10_score0: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams without score are used to build the cumulated value
cs_height_10_score1: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams with score (that have been expertised as no or small barrier for eel) are used to build the cumulated value
cs_height_10_SP: Cumulated height from the sea, no transformation, no prediction for missing values, only the dams from Spain are considered when building on a transnational water course.
cs_height_12_n: Cumulated height from the sea, dam height transformed with power 1.2, no prediction for missing values.
cs_height_12_n.: Same variable but truncated to 500
cs_height_12_p: Cumulated height from the sea, dam height transformed with power 1.2, with prediction for missing values.
cs_height_12_p.: Same variable but truncated to 500
cs_height_12_pp: Cumulated height from the sea, dam height transformed with power 1.2, with prediction for missing values, the height of dam is set to zero if equiped with an efficient fishway for eel
cs_height_12_pps: Cumulated height from the sea, dam height transformed with power 1.2, with prediction for missing values, the height of dam is set to zero if a score of efficient passage was attributed for eel on this structure
cs_height_15_n: Cumulated height from the sea, dam height transformed with power 1.5, no prediction for missing values
cs_height_15_n.: Same variable but truncated to 800
cs_height_15_p: Cumulated height from the sea, dam height transformed with power 1.5, with prediction for missing values
cs_height_15_p.: Same variable but truncated to 800
cs_height_15_pp: Cumulated height from the sea, dam height transformed with power 1.5, with prediction for missing values, the height of dam is set to zero if equiped with an efficient fishway for eel
cs_height_15_pps: Cumulated height from the sea, dam height transformed with power 1.5, with prediction for missing values, the height of dam is set to zero if a score of efficient passage was attributed for eel on this structure
cumnbdamp: Cumulated number of dam from the sea
cumnbdamso: duplicate of cumnbdamp
cumwettedsurfacebothkm2.: Surface of water downstream from the segment in the river. Corresponds to both riversegment and waterbodies
cumwettedsurfacekm2.: Surface of water downstream from the segment in the river. Corresponds only to rivers
cumwettedsurfaceotherkm2.: Surface of water downstream from the segment in the river. Corresponds only waterbodies (water surfaces, associated with the segment).
densCS: Density from Carle and Strub, number in second pass extrapolated from efficiency if only one pass
dist_from_gibraltar_km: Distance to Gibraltar calculated using an enveloppe along the coastline (e.g. the estuaries ingress inland are not counted for this distance).
distanceseakm: Distance to the sea
distanceseakm.: Distance to the sea, truncated at 500
distancesourcem: distance to the source in meters
downstdrainagewettedsurfaceboth.: Percentage of wetted surface
Housing Price Analysis and Prediction
kaggle.com
Updated Feb 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2024). Housing Price Analysis and Prediction [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/housing-price-analysis-and-prediction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Reda Elblgihy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Steps Throughout the Full Project:

1- Initial Data Exploration: Introduction to the dataset and its variables. Identification of potential relationships between variables. Examination of data quality issues such as missing values and outliers.

2- Correlation Analysis: Utilization of correlation matrices and heatmaps to identify relationships between variables. Focus on variables highly correlated with the target variable, 'SalePrice'.

3- Handling Missing Data: Analysis of missing data prevalence and patterns. Deletion of variables with high percentages of missing data. Treatment of missing observations for remaining variables based on their importance.

4- Dealing with Outliers: Identification and handling of outliers using data visualization and statistical methods. Removal of outliers that significantly deviate from the overall pattern.

5- Testing Statistical Assumptions: Assessment of normality, homoscedasticity, linearity, and absence of correlated errors. Application of data transformations to meet statistical assumptions.

6- Conversion of Categorical Variables: Conversion of categorical variables into dummy variables to prepare for modeling.

Summary: The project undertook a comprehensive analysis of housing price data, encompassing data exploration, correlation analysis, missing data handling, outlier detection, and testing of statistical assumptions. Through visualization and statistical methods, the project identified key relationships between variables and prepared the data for predictive modeling.

Recommendations: Further exploration of advanced modeling techniques such as regularized linear regression and ensemble methods for predicting housing prices. Consideration of additional variables or feature engineering to improve model performance. Evaluation of model performance using cross-validation and other validation techniques. Documentation and communication of findings and recommendations for stakeholders or further research.
Effect of sampling scenario and abundance transformation on FD index...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Májeková; Taavi Paal; Nichola S. Plowman; Michala Bryndová; Liis Kasari; Anna Norberg; Matthias Weiss; Tom R. Bishop; Sarah H. Luke; Katerina Sam; Yoann Le Bagousse-Pinguet; Jan Lepš; Lars Götzenberger; Francesco de Bello (2023). Effect of sampling scenario and abundance transformation on FD index sensitivity. [Dataset]. https://plos.figshare.com/articles/dataset/Effect_of_sampling_scenario_and_abundance_transformation_on_FD_index_sensitivity_/2296009
Explore at:
xlsAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Maria Májeková; Taavi Paal; Nichola S. Plowman; Michala Bryndová; Liis Kasari; Anna Norberg; Matthias Weiss; Tom R. Bishop; Sarah H. Luke; Katerina Sam; Yoann Le Bagousse-Pinguet; Jan Lepš; Lars Götzenberger; Francesco de Bello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Effect of sampling scenario and abundance transformation on FD index sensitivity.
m
Transformed Customer Shopping Dataset with Advanced Feature Engineering and...
data.mendeley.com
Updated Jul 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Zinnahtur Rahman Zitu (2025). Transformed Customer Shopping Dataset with Advanced Feature Engineering and Anonymization [Dataset]. http://doi.org/10.17632/fnhyc6drm8.1
Explore at:
Unique identifier
https://doi.org/10.17632/fnhyc6drm8.1
Dataset updated
Jul 21, 2025
Authors
Md Zinnahtur Rahman Zitu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset represents a thoroughly transformed and enriched version of a publicly available customer shopping dataset. It has undergone comprehensive processing to ensure it is clean, privacy-compliant, and enriched with new features, making it highly suitable for advanced analytics, machine learning, and business research applications.

The transformation process focused on creating a high-quality dataset that supports robust customer behavior analysis, segmentation, and anomaly detection, while maintaining strict privacy through anonymization and data validation.

➡ Data Cleaning and Preprocessing : Duplicates were removed. Missing numerical values (Age, Purchase Amount, Review Rating) were filled with medians; missing categorical values labeled “Unknown.” Text data were cleaned and standardized, and numeric fields were clipped to valid ranges.

➡ Feature Engineering : New informative variables were engineered to augment the dataset’s analytical power. These include: • Avg_Amount_Per_Purchase: Average purchase amount calculated by dividing total purchase value by the number of previous purchases, capturing spending behavior per transaction. • Age_Group: Categorical age segmentation into meaningful bins such as Teen, Young Adult, Adult, Senior, and Elder. • Purchase_Frequency_Score: Quantitative mapping of purchase frequency to annualized values to facilitate numerical analysis. • Discount_Impact: Monetary quantification of discount application effects on purchases. • Processing_Date: Timestamp indicating the dataset transformation date for provenance tracking.

➡ Data Filtering : Rows with ages outside 0–100 were removed. Only core categories (Clothing, Footwear, Outerwear, Accessories) and the top 25% of high-value customers by purchase amount were retained for focused analysis.

➡ Data Transformation : Key numeric features were standardized, and log transformations were applied to skewed data to improve model performance.

➡ Advanced Features : Created a category-wise average purchase and a loyalty score combining purchase frequency and volume.

➡ Segmentation & Anomaly Detection : Used KMeans to cluster customers into four groups and Isolation Forest to flag anomalies.

➡ Text Processing : Cleaned text fields and added a binary indicator for clothing items.

➡ Privacy : Hashed Customer ID and removed sensitive columns like Location to ensure privacy.

➡ Validation : Automated checks for data integrity, including negative values and valid ranges.

This transformed dataset supports a wide range of research and practical applications, including customer segmentation, purchase behavior modeling, marketing strategy development, fraud detection, and machine learning education. It serves as a reliable and privacy-aware resource for academics, data scientists, and business analysts.
d
Hourly wind speed in miles per hour and associated three-digit data-source...
catalog.data.gov
search.dataone.org
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Hourly wind speed in miles per hour and associated three-digit data-source flag, January 1, 1948 - September 30, 2016 [Dataset]. https://catalog.data.gov/dataset/hourly-wind-speed-in-miles-per-hour-and-associated-three-digit-data-source-flag-january-30
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
The text file "Wind speed.txt" contains hourly data and associated data-source flag from January 1, 1948, to September 30, 2016. The primary source of the data is the Argonne National Laboratory, Illinois (ANL). The data-source flag consist of a three-digit sequence in the form "xyz" that describe the origin and transformations of the data values. They indicate if the data are original or missing, the method that was used to fill the missing periods, and any other transformations of the data. Missing and apparently erroneous data values were replaced with adjusted values from nearby stations used as “backup”. As stated in Over and others (2010), temporal variations in the statistical properties of the data resulting from changes in measurement and data storage methodologies were adjusted to match the statistical properties resulting from the data collection procedures that have been in place since January 1, 1989. The adjustments were computed based on the regressions between the primary data series from ANL and the backup series using data obtained during common periods; the statistical properties of the regressions were used to assign estimated standard errors to values that were adjusted or filled from other series. Each hourly value is assigned a corresponding data source flag that indicates the source of the value and its transformations. As described in Over and others (2010), each flag is of the form "xyz" that allows the user to determine its source and the methods used to process the data. During the period 01/09/2016 hour 21 to 01/10/2016 hour 24 both ANL and the primary backup station at St. Charles, Illinois had missing wind speed data. The O'Hare International Airport (ORD) is used as an alternate backup station and the new regression equation and the corresponding new flag for wind speed are established using daily wind data from ORD for the period 10/01/2007 through 09/30/2016 following the guideline described in Over and others (2010). Reference Cited: Over, T.M., Price, T.H., and Ishii, A.L., 2010, Development and analysis of a meteorological database, Argonne National Laboratory, Illinois: U.S. Geological Survey Open File Report 2010-1220, 67 p., http://pubs.usgs.gov/of/2010/1220/.
f
Supplement 1. A file containing 471 data sets compiled from the literature...
wiley.figshare.com
html
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiao Xiao; Ethan P. White; Mevin B. Hooten; Susan L. Durham (2023). Supplement 1. A file containing 471 data sets compiled from the literature describing power-law relationships in ecology, evolution, and physiology. [Dataset]. http://doi.org/10.6084/m9.figshare.3551976.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3551976.v1
Dataset updated
May 31, 2023
Dataset provided by
Wiley
Authors
Xiao Xiao; Ethan P. White; Mevin B. Hooten; Susan L. Durham
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List Sup_1_Data.csv Description The Sup_1_Data.csv file contains 471 data sets compiled from literature describing power-law relationships between two variables in ecology, evolution, and physiology. For the full list of citations, see Appendix B of this paper. Column definitions:

Dataset ID x – independent variable in the original dataset y – dependent variable in the original dataset Checksum values are: Column 1 (Dataset ID): SUM = 4575923; 0 missing values (rows with data: 24902) Column 2 (independent variable x): SUM = 1413965769; 0 missing values (rows with data: 24902) Column 3 (dependent variable y): SUM = 2137944719097.652; 0 missing values (row with data: 24902)
Superstore Sales Analysis
kaggle.com
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Reda Elblgihy
Description
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.

Utilize Power Query to clean, transform, and structure the data for analysis.

Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.

Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.

Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.

Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.

Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.

Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
Z
EA-MD-QD: Large Euro Area and Euro Member Countries Datasets for...
data.niaid.nih.gov
zenodo.org
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barigozzi, Matteo (2025). EA-MD-QD: Large Euro Area and Euro Member Countries Datasets for Macroeconomic Research [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10514667
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
Barigozzi, Matteo
Lissona, Claudio
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
EA-MD-QD is a collection of large monthly and quarterly EA and EA member countries datasets for macroeconomic analysis.The EA member countries covered are: AT, BE, DE, EL, ES, FR, IE, IT, NL, PT.

The formal reference to this dataset is:

Barigozzi, M. and Lissona, C. (2024) "EA-MD-QD: Large Euro Area and Euro Member Countries Datasets for Macroeconomic Research". Zenodo.

Please refer to it when using the data.

Each zip file contains:- Excel files for the EA and the countries covered, each containing an unbalanced panel of raw de-seasonalized data.- A Matlab code taking as input the raw data and allowing to perform various operations such as:choose the frequency, fill-in missing values, transform data to stationarity, and control for covid outliers.- A pdf file with all informations about the series names, sources, and transformation codes.

This version (03.2025):

Updated data as of 28-March-2025. We improved the matlab code and included a ReadME file containing details on the parameters' choice from the user, which before were only briefly commented in the code.
S
Digitalisation and Green Transformation Synergies and Corporate Debt...
scidb.cn
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cheng xiao li (2024). Digitalisation and Green Transformation Synergies and Corporate Debt Financing Costs Dataset of China's Shanghai and Shenzhen A-share Listed Companies, 2007-2022 [Dataset]. http://doi.org/10.57760/sciencedb.18904
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.18904
Dataset updated
Dec 26, 2024
Dataset provided by
Science Data Bank
Authors
cheng xiao li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Shenzhen, Shanghai, China
Description
The dataset collects some data from China's Shanghai and Shenzhen A-share listed companies from 2007-2022 in the Cathay Pacific database and the Dibbo database, collects the annual reports of relevant companies in the Juchao information website, uses Python to perform text analysis, obtains the number of frequency of words for digitalisation and greening transformation, and performs data processing using excel to remove the missing values and perform the indentation process through the Stata software.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data

Employment Of India CLeaned and Messy Data

From Chaos to Clarity: Cleaning Real-World Employment Data for Accurate Insights

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 7, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

SONIA SHINDE

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Area covered

India

Description

This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

Clear search

Close search

Google apps

Main menu

Employment Of India CLeaned and Messy Data

🔹 Dataset Composition:

Transformations & Cleaning Applied:

Purpose & Utility:

Data from: Integrating Data Transformation in Principal Components Analysis

Evaluating Functional Diversity: Missing Trait Data and the Importance of...

Data from: Managers' and physicians’ perception of palm vein technology...

Eel data (Anguilla anguilla) and associated environment variables for eel in...

1. DESCRIPTION

1.1. THE SUDOANG PROJECT

1.2. TEMPORAL SCOPE

1.3. GEOGRAPHICAL RANGE

2. DATASETS DESCRIPTION

2.1. DENSITIES AND PRESENCE ABSENCE

Dataset: frsppt_12_2020.Rdata

Housing Price Analysis and Prediction

Effect of sampling scenario and abundance transformation on FD index...

Transformed Customer Shopping Dataset with Advanced Feature Engineering and...

Hourly wind speed in miles per hour and associated three-digit data-source...

Supplement 1. A file containing 471 data sets compiled from the literature...

Superstore Sales Analysis

EA-MD-QD: Large Euro Area and Euro Member Countries Datasets for...

Digitalisation and Green Transformation Synergies and Corporate Debt...

Employment Of India CLeaned and Messy Data

From Chaos to Clarity: Cleaning Real-World Employment Data for Accurate Insights

🔹 Dataset Composition:

Transformations & Cleaning Applied:

Purpose & Utility: