Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Daily Machine Learning Practice – 1 Commit per Day
Author: Astrid Villalobos Location: Montréal, QC LinkedIn: https://www.linkedin.com/in/astridcvr/
Objective The goal of this project is to strengthen Machine Learning and data analysis skills through small, consistent daily contributions. Each commit focuses on a specific aspect of data processing, feature engineering, or modeling using Python, Pandas, and Scikit-learn.
Dataset Source: Kaggle – Sample Sales Data File: data/sales_data_sample.csv Variables: ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, COUNTRY, etc. Goal: Analyze e-commerce performance, predict sales trends, segment customers, and forecast demand.
**Project Rules **Rule Description 🟩 1 Commit per Day Minimum one line of code daily to ensure consistency and discipline 🌍 Bilingual Comments Code and documentation in English and French 📈 Visible Progress Daily green squares = daily learning 🧰 Tech Stack
Languages: Python Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn Tools: Jupyter Notebook, GitHub, Kaggle
Learning Outcomes By the end of this challenge: Develop a stronger understanding of data preprocessing, modeling, and evaluation. Build consistent coding habits through daily practice. Apply ML techniques to real-world sales data scenarios.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is the raw medical dataset used in my data cleaning project. It contains original, unprocessed data with missing values, inconsistent formatting, and possible duplicates. This dataset is ideal for practicing data cleaning, preprocessing, and exploratory data analysis (EDA).
Note: This dataset is anonymized and intended for educational purposes only.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data set is uploaded as supporting information for the publication entitled:Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging modelsFiles are as follows:polymer_microarray_data.mat - MATLAB workspace file containing peak-picked ToF-SIMS data (hyperspectral array) for the polymer microarray sample.nylon_data.mat - MATLAB workspace file containing m/z binned ToF-SIMS data (hyperspectral array) for the semi-synthetic nylon data set, generated from 7 nylon samples.Additional details about the datasets can be found in the published article.If you use this data set in your work, please cite our work as follows:Cite as: Gardner et al.. J. Vac. Sci. Technol. A 41, 000000 (2023); doi: 10.1116/6.0002788
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for the practice in the data preprocessing and unsupervised learning in the introduction to bioinformatics course
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).
The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage
This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Fruit Classification Dataset is a beginner classification dataset configured to classify fruit types based on fruit name, color, and weight information.
2) Data Utilization (1) Fruit Classification Dataset has characteristics that: • This dataset consists of a total of three columns: categorical variable Color, continuous variable Weight, and target class Fruit, allowing you to pre-process categorical and numerical variables when learning classification models. (2) Fruit Classification Dataset can be used to: • Model learning and evaluation: It can be used as educational and research experimental data to compare and evaluate the performance of various machine learning classification algorithms using color and weight characteristics. • Data preprocessing practice: can be used as hands-on data to learn basic data preprocessing and feature engineering courses such as categorical variable encoding and continuous variable scaling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary materials for appliacation of DA_2DCHROM - data alignment
https://doi.org/10.5281/zenodo.7040975
Content:
data – folder for raw data
full_dataset_alignment – 100 graphical comparisons of randomly picked pairs from full dataset
graph_results – graphical representation of obtained results
metadata – results of midsteps in data alignment process
data folder:
Subfolder Sample_dataset contains 20 sample chromatograms, each processed for S/N 100, 300 and 500 level
full_dataset_alignment:
100 graphical comparisons of randomly picked pairs from full dataset. The pairs are same for both algorithms.
graph_results folder:
To reduce the total size of Supplementary materials, only results for S/N level 500 are exported.
Each subfolder (names of folder correspond with the names of algorithms used through the study) contains numerical (K-S test) and graphical representation of the alignment. In case of failed alignment (not enough anchor points in case of BiPACE2D for example), the graphs are left blank.
metadata folder:
merged_peaks folder
Folder containing formated data with merged peaks (results of preprocessing part of data_alignment_chromatography_v1.4 script)
ref_data folder
Lists of manually exported referential peaks for each chromatogram. Input data for RI algorhitm.
time_correction folder
Each algorithm subfolder contains the result of data alignment itself. For each aligned chromatogram, there are 3 files – aligned chromatogram itself (.txt file with the most bytes), lists of detected anchor peaks (.txt with _anchors extension), and simple graphical check of alignment itself (.png)
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description:
The myusabank.csv dataset contains daily financial data for a fictional bank (MyUSA Bank) over a two-year period. It includes various key financial metrics such as interest income, interest expense, average earning assets, net income, total assets, shareholder equity, operating expenses, operating income, market share, and stock price. The data is structured to simulate realistic scenarios in the banking sector, including outliers, duplicates, and missing values for educational purposes.
Potential Student Tasks:
Data Cleaning and Preprocessing:
Exploratory Data Analysis (EDA):
Calculating Key Performance Indicators (KPIs):
Building Tableau Dashboards:
Forecasting and Predictive Modeling:
Business Insights and Reporting:
Educational Goals:
The dataset aims to provide hands-on experience in data preprocessing, analysis, and visualization within the context of banking and finance. It encourages students to apply data science techniques to real-world financial data, enhancing their skills in data-driven decision-making and strategic analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Harry Potter Sorting Dataset contains various attributes and Hogwarts dorm assignments of 1,000 virtual students in Harry Potter's worldview, and is designed to be used in machine learning classification exercises such as dorm classification based on student tendencies.
2) Data Utilization (1) Harry Potter Sorting Dataset has characteristics that: • Each student contains the actual assigned Hogwarts dorm (House) information, along with several attribute columns, including name, gender, ancestry, region of origin, personality traits, and magic-related abilities. • The House is divided into four categories: Gryffindor, Slytherin, Ravenclaw, and Hufflepuff. (2) Harry Potter Sorting Dataset can be used to: • Development of boarding classification model: Using student characteristic data, we can build a Hogwarts House classification machine learning model and evaluate prediction accuracy. • Data Science Practice and Training: It can be used for data science and machine learning training practices, such as characteristic selection, data preprocessing, and classification modeling.
Facebook
Twitterhttp://dcat-ap.ch/vocabulary/licenses/terms_byhttp://dcat-ap.ch/vocabulary/licenses/terms_by
This dataset contains open vector data for railways, forests and power lines, as well an open digital elevation model (DEM) for a small area around a sample forest range in Europe (Germany, Upper Bavaria, Kochel Forest Range, some 70 km south of München, at the edge of Bavarian Alps). The purpose of this dataset is to provide a documented sample dataset in order to demonstrate geospatial preprocessing at FOSS4G2019 based on open data and software. This sample has been produced based on several existing open data sources (detailed below), therefore documenting the sources for obtaining some data needed for computations related to forest accessibility and wood harvesting. For example, they can be used with the open methodology and QGIS plugin Seilaplan for optimising the geometric layout cable roads or with additional open software for computing the forest accessibility for wood harvesting. The vector data (railways, forests and power lines) was extracted from OpenStreetMap (data copyrighted OpenStreetMap contributors and available from https://www.openstreetmap.org). The railways and forests were downloaded and extracted on 18.05.2019 using the open sources QGIS (https://www.qgis.org) with the QuickOSM plugin, while the power lines were downloaded a couple of days later on 23.05.2019.
Additional notes for vector data: Please note that OpenStreeMap data extracts such as forests, roads and railways (except power lines) can also be downloaded in a GIS friendly format (Shapefile) from http://download.geofabrik.de/ or using the QGIS built-in download function for OpenStreetMap data. The most efficient way to retrieve specific OSM tags (such as power=line) is to use the QuickOSM plugin for QGIS (using the Overpass API - https://wiki.openstreetmap.org/wiki/Overpass_API) or directly using overpass turbo (https://overpass-turbo.eu/). Finally, the digitised perimeter of the sample forest range is also made available for reproducibility purposes, although any perimeter or area can be digitised freely using the QGIS editing toolbar.
The DEM was originally adapted and modified also with QGIS (https://www.qgis.org) based on the elevation data available from two different sources, by reprojecting and downsampling datasets to 25m then selecting, for each individual raster cell, the elevation value that was closer to the average. These two different elevation sources are:
This methodology was chosen as a way of performing a basic quality check, by comparing the EU-DEM v.1.1 derived from globally available DEM data (such as SRTM) with more authoritative data for the randomly selected region, since using authoritative data is preferred (if open and available). For other sample regions, where authoritative open data is not available, such comparisons cannot longer be performed.
Additional notes DEM: a very good DEM open data source for Germany is the open data set collected and resampled by Sonny (sonnyy7@gmail.com) and made available on the Austrian Open Data Portal http://data.opendataportal.at/dataset/dtm-germany. In order to simplify end-to-end reproducibility of the paper planned for FOSS4G2019, we use and distribute an adapted (reprojected and resampled to 25 meters) sample of the above mentioned dataset for the selected forest range.
This sample dataset is accompanied by software in Python, as a Jupiter Notebook that generates harmonized output rasters with the same extent from the input data. The extent is given by the polygon vector dataset (Perimeter). These output rasters, such as obstacles, aspect, slope, forest cover, can serve as input data for later computations related to forest accessibility and wood harvesting questions. The obstacles output is obtained by transforming line vector datasets (railway lines, high voltage power lines) to raster. Aspect and slope are both derived from the sample digital elevation model.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).
This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.
This sample dataset also includes files relative to metadata, static data, normalization, and plotting.
To use the data, clone the corresponding repository and unzip this zip file in the data folder.
Facebook
TwitterIn this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Assessment of the sufficiency of information provided for reproducibility.
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Random Sample of NIH Chest X-ray Dataset is a sample version of a large public medical imaging dataset containing 112,120 chest X-ray images and 15 disease (or normal) labels collected from 30,805 patients.
2) Data Utilization (1) Random Sample of NIH Chest X-ray Dataset has characteristics that: • Each sample comes with detailed metadata such as image file name, disease label, patient ID, age, gender, direction of shooting, and image size, and the label extracts the radiographic reading report with NLP, showing an accuracy of more than 90%. • It contains 5,606 1024x1024 size images, consisting of 14 diseases and a 'No Finding' class, but due to the nature of the sample, some disease data are very scarce. (2) Random Sample of NIH Chest X-ray Dataset can be used to: • Development of chest disease image reading AI: Using X-ray images with various chest disease labels, deep learning-based automatic diagnosis and classification models can be trained and evaluated. • Medical image data preprocessing and labeling research: It can be used for medical artificial intelligence research and algorithm development such as automatic labeling of large medical image datasets, data quality evaluation, and weak-supervised learning.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of original and reimplementation results of Liu paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of original and reimplementation results of Wang paper.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Explore the dynamic Pharmaceutical Sample Preprocessing System market with insights on growth drivers, trends, restraints, and regional analysis. Discover market size projections and key players shaping the future of drug discovery and diagnostics.
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning