MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.
It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.
Each record captures multiple attributes related to individuals in the Indian job market, including:
- Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording
The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.
This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools
It's also useful for:
- Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines
By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Principal component analysis (PCA) is a popular dimension-reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated using simulated and real-world data examples. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Functional diversity (FD) is an important component of biodiversity that quantifies the difference in functional traits between organisms. However, FD studies are often limited by the availability of trait data and FD indices are sensitive to data gaps. The distribution of species abundance and trait data, and its transformation, may further affect the accuracy of indices when data is incomplete. Using an existing approach, we simulated the effects of missing trait data by gradually removing data from a plant, an ant and a bird community dataset (12, 59, and 8 plots containing 62, 297 and 238 species respectively). We ranked plots by FD values calculated from full datasets and then from our increasingly incomplete datasets and compared the ranking between the original and virtually reduced datasets to assess the accuracy of FD indices when used on datasets with increasingly missing data. Finally, we tested the accuracy of FD indices with and without data transformation, and the effect of missing trait data per plot or per the whole pool of species. FD indices became less accurate as the amount of missing data increased, with the loss of accuracy depending on the index. But, where transformation improved the normality of the trait data, FD values from incomplete datasets were more accurate than before transformation. The distribution of data and its transformation are therefore as important as data completeness and can even mitigate the effect of missing data. Since the effect of missing trait values pool-wise or plot-wise depends on the data distribution, the method should be decided case by case. Data distribution and data transformation should be given more careful consideration when designing, analysing and interpreting FD studies, especially where trait data are missing. To this end, we provide the R package “traitor” to facilitate assessments of missing trait data.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RSPAZQhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/RSPAZQ
Data from: Doctoral dissertation; Preprint article entitled: Managers' and physicians’ perception of palm vein technology adoption in the healthcare industry. Formats of the files associated with dataset: CSV; SAV. SPSS setup files can be used to generate native SPSS file formats such as SPSS system files and SPSS portable files. SPSS setup files generally include the following SPSS sections: DATA LIST: Assigns the name, type, decimal specification (if any), and specifies the beginning and ending column locations for each variable in the data file. Users must replace the "physical-filename" with host computer-specific input file specifications. For example, users on Windows platforms should replace "physical-filename" with "C:\06512-0001-Data.txt" for the data file named "06512-0001-Data.txt" located on the root directory "C:". VARIABLE LABELS: Assigns descriptive labels to all variables. Variable labels and variable names may be identical for some variables. VALUE LABELS: Assigns descriptive labels to codes in the data file. Not all variables necessarily have assigned value labels. MISSING VALUES: Declares user-defined missing values. Not all variables in the data file necessarily have user-defined missing values. These values can be treated specially in data transformations, statistical calculations, and case selection. MISSING VALUE RECODE: Sets user-defined numeric missing values to missing as interpreted by the SPSS system. Only variables with user-defined missing values are included in the statements. ABSTRACT: The purpose of the article is to examine the factors that influence the adoption of palm vein technology by considering the healthcare managers’ and physicians’ perception, using the Unified Theory of Acceptance and Use of Technology theoretical foundation. A quantitative approach was used for this study through which an exploratory research design was utilized. A cross-sectional questionnaire was distributed to responders who were managers and physicians in the healthcare industry and who had previous experience with palm vein technology. The perceived factors tested for correlation with adoption were perceived usefulness, complexity, security, peer influence, and relative advantage. A Pearson product-moment correlation coefficient was used to test the correlation between the perceived factors and palm vein technology. The results showed that perceived usefulness, security, and peer influence are important factors for adoption. Study limitations included purposive sampling from a single industry (healthcare) and limited literature was available with regard to managers’ and physicians’ perception of palm vein technology adoption in the healthcare industry. Researchers could focus on an examination of the impact of mediating variables on palm vein technology adoption in future studies. The study offers managers insight into the important factors that need to be considered in adopting palm vein technology. With biometric technology becoming pervasive, the study seeks to provide managers with the insight in managing the adoption of palm vein technology. KEYWORDS: biometrics, human identification, image recognition, palm vein authentication, technology adoption, user acceptance, palm vein technology
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SUDOANG project aims at providing common tools to managers to support eel conservation in the SUDOE area (Spain, France and Portugal).
Three main datasets have been used to implement EDA.
Electrofishing data include site locations, fishing operations which can be done several times at one site, and fish data collected during each operation. The operations are attached to stretches of river or river segments whose characteristics describe the conditions for presence, density, size structure, or silvering rate of the eels.
The electrofishing operation are classified by type:
From 1985 to 2018, beware incomplete dataset after 2015 in France.
The SUDOE area including Iberian Peninsula and France.
Most variables are built here.
The script to create cumulated values for dams can be found here.
For a technical description see the report.
This file contains the following datasets:
And the following columns (in alphabetical order):
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Steps Throughout the Full Project:
1- Initial Data Exploration: Introduction to the dataset and its variables. Identification of potential relationships between variables. Examination of data quality issues such as missing values and outliers.
2- Correlation Analysis: Utilization of correlation matrices and heatmaps to identify relationships between variables. Focus on variables highly correlated with the target variable, 'SalePrice'.
3- Handling Missing Data: Analysis of missing data prevalence and patterns. Deletion of variables with high percentages of missing data. Treatment of missing observations for remaining variables based on their importance.
4- Dealing with Outliers: Identification and handling of outliers using data visualization and statistical methods. Removal of outliers that significantly deviate from the overall pattern.
5- Testing Statistical Assumptions: Assessment of normality, homoscedasticity, linearity, and absence of correlated errors. Application of data transformations to meet statistical assumptions.
6- Conversion of Categorical Variables: Conversion of categorical variables into dummy variables to prepare for modeling.
Summary: The project undertook a comprehensive analysis of housing price data, encompassing data exploration, correlation analysis, missing data handling, outlier detection, and testing of statistical assumptions. Through visualization and statistical methods, the project identified key relationships between variables and prepared the data for predictive modeling.
Recommendations: Further exploration of advanced modeling techniques such as regularized linear regression and ensemble methods for predicting housing prices. Consideration of additional variables or feature engineering to improve model performance. Evaluation of model performance using cross-validation and other validation techniques. Documentation and communication of findings and recommendations for stakeholders or further research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Effect of sampling scenario and abundance transformation on FD index sensitivity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset represents a thoroughly transformed and enriched version of a publicly available customer shopping dataset. It has undergone comprehensive processing to ensure it is clean, privacy-compliant, and enriched with new features, making it highly suitable for advanced analytics, machine learning, and business research applications.
The transformation process focused on creating a high-quality dataset that supports robust customer behavior analysis, segmentation, and anomaly detection, while maintaining strict privacy through anonymization and data validation.
➡ Data Cleaning and Preprocessing : Duplicates were removed. Missing numerical values (Age, Purchase Amount, Review Rating) were filled with medians; missing categorical values labeled “Unknown.” Text data were cleaned and standardized, and numeric fields were clipped to valid ranges.
➡ Feature Engineering : New informative variables were engineered to augment the dataset’s analytical power. These include: • Avg_Amount_Per_Purchase: Average purchase amount calculated by dividing total purchase value by the number of previous purchases, capturing spending behavior per transaction. • Age_Group: Categorical age segmentation into meaningful bins such as Teen, Young Adult, Adult, Senior, and Elder. • Purchase_Frequency_Score: Quantitative mapping of purchase frequency to annualized values to facilitate numerical analysis. • Discount_Impact: Monetary quantification of discount application effects on purchases. • Processing_Date: Timestamp indicating the dataset transformation date for provenance tracking.
➡ Data Filtering : Rows with ages outside 0–100 were removed. Only core categories (Clothing, Footwear, Outerwear, Accessories) and the top 25% of high-value customers by purchase amount were retained for focused analysis.
➡ Data Transformation : Key numeric features were standardized, and log transformations were applied to skewed data to improve model performance.
➡ Advanced Features : Created a category-wise average purchase and a loyalty score combining purchase frequency and volume.
➡ Segmentation & Anomaly Detection : Used KMeans to cluster customers into four groups and Isolation Forest to flag anomalies.
➡ Text Processing : Cleaned text fields and added a binary indicator for clothing items.
➡ Privacy : Hashed Customer ID and removed sensitive columns like Location to ensure privacy.
➡ Validation : Automated checks for data integrity, including negative values and valid ranges.
This transformed dataset supports a wide range of research and practical applications, including customer segmentation, purchase behavior modeling, marketing strategy development, fraud detection, and machine learning education. It serves as a reliable and privacy-aware resource for academics, data scientists, and business analysts.
The text file "Wind speed.txt" contains hourly data and associated data-source flag from January 1, 1948, to September 30, 2016. The primary source of the data is the Argonne National Laboratory, Illinois (ANL). The data-source flag consist of a three-digit sequence in the form "xyz" that describe the origin and transformations of the data values. They indicate if the data are original or missing, the method that was used to fill the missing periods, and any other transformations of the data. Missing and apparently erroneous data values were replaced with adjusted values from nearby stations used as “backup”. As stated in Over and others (2010), temporal variations in the statistical properties of the data resulting from changes in measurement and data storage methodologies were adjusted to match the statistical properties resulting from the data collection procedures that have been in place since January 1, 1989. The adjustments were computed based on the regressions between the primary data series from ANL and the backup series using data obtained during common periods; the statistical properties of the regressions were used to assign estimated standard errors to values that were adjusted or filled from other series. Each hourly value is assigned a corresponding data source flag that indicates the source of the value and its transformations. As described in Over and others (2010), each flag is of the form "xyz" that allows the user to determine its source and the methods used to process the data. During the period 01/09/2016 hour 21 to 01/10/2016 hour 24 both ANL and the primary backup station at St. Charles, Illinois had missing wind speed data. The O'Hare International Airport (ORD) is used as an alternate backup station and the new regression equation and the corresponding new flag for wind speed are established using daily wind data from ORD for the period 10/01/2007 through 09/30/2016 following the guideline described in Over and others (2010). Reference Cited: Over, T.M., Price, T.H., and Ishii, A.L., 2010, Development and analysis of a meteorological database, Argonne National Laboratory, Illinois: U.S. Geological Survey Open File Report 2010-1220, 67 p., http://pubs.usgs.gov/of/2010/1220/.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List Sup_1_Data.csv Description The Sup_1_Data.csv file contains 471 data sets compiled from literature describing power-law relationships between two variables in ecology, evolution, and physiology. For the full list of citations, see Appendix B of this paper. Column definitions:
Dataset ID
x – independent variable in the original dataset
y – dependent variable in the original dataset
Checksum values are:
Column 1 (Dataset ID): SUM = 4575923; 0 missing values (rows with data: 24902)
Column 2 (independent variable x): SUM = 1413965769; 0 missing values (rows with data: 24902)
Column 3 (dependent variable y): SUM = 2137944719097.652; 0 missing values (row with data: 24902)
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:
1- Data Import and Transformation:
2- Data Quality Assessment:
3- Calculating COGS:
4- Discount Analysis:
5- Sales Metrics:
6- Visualization:
7- Report Generation:
Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
EA-MD-QD is a collection of large monthly and quarterly EA and EA member countries datasets for macroeconomic analysis.The EA member countries covered are: AT, BE, DE, EL, ES, FR, IE, IT, NL, PT.
The formal reference to this dataset is:
Barigozzi, M. and Lissona, C. (2024) "EA-MD-QD: Large Euro Area and Euro Member Countries Datasets for Macroeconomic Research". Zenodo.
Please refer to it when using the data.
Each zip file contains:- Excel files for the EA and the countries covered, each containing an unbalanced panel of raw de-seasonalized data.- A Matlab code taking as input the raw data and allowing to perform various operations such as:choose the frequency, fill-in missing values, transform data to stationarity, and control for covid outliers.- A pdf file with all informations about the series names, sources, and transformation codes.
This version (03.2025):
Updated data as of 28-March-2025. We improved the matlab code and included a ReadME file containing details on the parameters' choice from the user, which before were only briefly commented in the code.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset collects some data from China's Shanghai and Shenzhen A-share listed companies from 2007-2022 in the Cathay Pacific database and the Dibbo database, collects the annual reports of relevant companies in the Juchao information website, uses Python to perform text analysis, obtains the number of frequency of words for digitalisation and greening transformation, and performs data processing using excel to remove the missing values and perform the indentation process through the Stata software.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.
It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.
Each record captures multiple attributes related to individuals in the Indian job market, including:
- Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording
The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.
This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools
It's also useful for:
- Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines
By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.