100+ datasets found

Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Daily Machine Learning Practice
kaggle.com
zip
Updated Nov 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Astrid Villalobos (2025). Daily Machine Learning Practice [Dataset]. https://www.kaggle.com/datasets/astridvillalobos/daily-machine-learning-practice
Explore at:
zip(1019861 bytes)Available download formats
Dataset updated
Nov 9, 2025
Authors
Astrid Villalobos
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Daily Machine Learning Practice – 1 Commit per Day

Author: Astrid Villalobos Location: Montréal, QC LinkedIn: https://www.linkedin.com/in/astridcvr/

Objective The goal of this project is to strengthen Machine Learning and data analysis skills through small, consistent daily contributions. Each commit focuses on a specific aspect of data processing, feature engineering, or modeling using Python, Pandas, and Scikit-learn.

Dataset Source: Kaggle – Sample Sales Data File: data/sales_data_sample.csv Variables: ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, COUNTRY, etc. Goal: Analyze e-commerce performance, predict sales trends, segment customers, and forecast demand.

**Project Rules **Rule Description 🟩 1 Commit per Day Minimum one line of code daily to ensure consistency and discipline 🌍 Bilingual Comments Code and documentation in English and French 📈 Visible Progress Daily green squares = daily learning 🧰 Tech Stack

Languages: Python Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn Tools: Jupyter Notebook, GitHub, Kaggle

Learning Outcomes By the end of this challenge: Develop a stronger understanding of data preprocessing, modeling, and evaluation. Build consistent coding habits through daily practice. Apply ML techniques to real-world sales data scenarios.
Raw Medical Dataset for Cleaning Practice
kaggle.com
zip
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aamir Shahzad (2025). Raw Medical Dataset for Cleaning Practice [Dataset]. https://www.kaggle.com/datasets/aamir5659/raw-medical-dataset-for-cleaning-practice/code
Explore at:
zip(1668 bytes)Available download formats
Dataset updated
Jul 5, 2025
Authors
Aamir Shahzad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the raw medical dataset used in my data cleaning project. It contains original, unprocessed data with missing values, inconsistent formatting, and possible duplicates. This dataset is ideal for practicing data cleaning, preprocessing, and exploratory data analysis (EDA).

Note: This dataset is anonymized and intended for educational purposes only.
l
Data set for article: Effect of data preprocessing and machine learning...
opal.latrobe.edu.au
researchdata.edu.au
hdf
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wil Gardner (2024). Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models [Dataset]. http://doi.org/10.26181/22671022.v1
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.26181/22671022.v1
Dataset updated
Mar 7, 2024
Dataset provided by
La Trobe
Authors
Wil Gardner
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This data set is uploaded as supporting information for the publication entitled:Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging modelsFiles are as follows:polymer_microarray_data.mat - MATLAB workspace file containing peak-picked ToF-SIMS data (hyperspectral array) for the polymer microarray sample.nylon_data.mat - MATLAB workspace file containing m/z binned ToF-SIMS data (hyperspectral array) for the semi-synthetic nylon data set, generated from 7 nylon samples.Additional details about the datasets can be found in the published article.If you use this data set in your work, please cite our work as follows:Cite as: Gardner et al.. J. Vac. Sci. Technol. A 41, 000000 (2023); doi: 10.1116/6.0002788
Dataset for practice session 1 in bioinformatics
figshare.com
txt
Updated Jul 17, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Sugis (2016). Dataset for practice session 1 in bioinformatics [Dataset]. http://doi.org/10.6084/m9.figshare.3490211.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3490211.v3
Dataset updated
Jul 17, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Elena Sugis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for the practice in the data preprocessing and unsupervised learning in the introduction to bioinformatics course
Retail Product Dataset with Missing Values
kaggle.com
zip
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
Explore at:
zip(47826 bytes)Available download formats
Dataset updated
Feb 17, 2025
Authors
Himel Sarder
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage

This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
c
Fruit Tabular Classification Dataset
cubig.ai
zip
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Fruit Tabular Classification Dataset [Dataset]. https://cubig.ai/store/products/563/fruit-tabular-classification-dataset
Explore at:
zipAvailable download formats
Dataset updated
Jul 8, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Fruit Classification Dataset is a beginner classification dataset configured to classify fruit types based on fruit name, color, and weight information.

2) Data Utilization (1) Fruit Classification Dataset has characteristics that: • This dataset consists of a total of three columns: categorical variable Color, continuous variable Weight, and target class Fruit, allowing you to pre-process categorical and numerical variables when learning classification models. (2) Fruit Classification Dataset can be used to: • Model learning and evaluation: It can be used as educational and research experimental data to compare and evaluate the performance of various machine learning classification algorithms using color and weight characteristics. • Data preprocessing practice: can be used as hands-on data to learn basic data preprocessing and feature engineering courses such as categorical variable encoding and continuous variable scaling.
Z
DA_2DCHROM - Sample dataset
data.niaid.nih.gov
Updated Sep 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ladislavova Nikola; Pojmanova Petra (2022). DA_2DCHROM - Sample dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7068335
Explore at:
Dataset updated
Sep 12, 2022
Dataset provided by
UCT Prague, Department of Analytical chemistry
Authors
Ladislavova Nikola; Pojmanova Petra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary materials for appliacation of DA_2DCHROM - data alignment

https://doi.org/10.5281/zenodo.7040975

Content:

data – folder for raw data

full_dataset_alignment – 100 graphical comparisons of randomly picked pairs from full dataset

graph_results – graphical representation of obtained results

metadata – results of midsteps in data alignment process

data folder:

Subfolder Sample_dataset contains 20 sample chromatograms, each processed for S/N 100, 300 and 500 level

full_dataset_alignment:

100 graphical comparisons of randomly picked pairs from full dataset. The pairs are same for both algorithms.

graph_results folder:

To reduce the total size of Supplementary materials, only results for S/N level 500 are exported.

Each subfolder (names of folder correspond with the names of algorithms used through the study) contains numerical (K-S test) and graphical representation of the alignment. In case of failed alignment (not enough anchor points in case of BiPACE2D for example), the graphs are left blank.

metadata folder:

merged_peaks folder

Folder containing formated data with merged peaks (results of preprocessing part of data_alignment_chromatography_v1.4 script)

ref_data folder

Lists of manually exported referential peaks for each chromatogram. Input data for RI algorhitm.

time_correction folder

Each algorithm subfolder contains the result of data alignment itself. For each aligned chromatogram, there are 3 files – aligned chromatogram itself (.txt file with the most bytes), lists of detected anchor peaks (.txt with _anchors extension), and simple graphical check of alignment itself (.png)
n
Data from: Assessing predictive performance of supervised machine learning...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrh
Dataset updated
May 23, 2023
Dataset provided by
Strathmore University
Authors
Evans Omondi
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
USA Bank Financial Data
kaggle.com
zip
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VISHAL SINGH SANGRAL (2024). USA Bank Financial Data [Dataset]. https://www.kaggle.com/datasets/vishalsinghsangral/usa-bank-financial-data
Explore at:
zip(20684 bytes)Available download formats
Dataset updated
Jun 28, 2024
Authors
VISHAL SINGH SANGRAL
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description:

The myusabank.csv dataset contains daily financial data for a fictional bank (MyUSA Bank) over a two-year period. It includes various key financial metrics such as interest income, interest expense, average earning assets, net income, total assets, shareholder equity, operating expenses, operating income, market share, and stock price. The data is structured to simulate realistic scenarios in the banking sector, including outliers, duplicates, and missing values for educational purposes.

Potential Student Tasks:

Data Cleaning and Preprocessing:

Handle missing values, duplicates, and outliers to ensure data integrity.

Normalize or scale data as needed for analysis.

Exploratory Data Analysis (EDA):

Visualize trends and distributions of financial metrics over time.

Identify correlations between different financial indicators.

Calculating Key Performance Indicators (KPIs):

Compute metrics such as Net Interest Margin (NIM), Return on Assets (ROA), Return on Equity (ROE), and Cost-to-Income Ratio using calculated fields.

Analyze the financial health and performance of MyUSA Bank based on these KPIs.

Building Tableau Dashboards:

Design interactive dashboards to present insights and trends.

Include summary cards, bar charts, line charts, and pie charts to visualize financial performance metrics.

Forecasting and Predictive Modeling:

Use historical data to forecast future financial performance.

Apply regression or time series analysis to predict market share or stock price movements.

Business Insights and Reporting:

Interpret findings to derive actionable insights for bank management.

Prepare reports or presentations summarizing key findings and recommendations.

Educational Goals:

The dataset aims to provide hands-on experience in data preprocessing, analysis, and visualization within the context of banking and finance. It encourages students to apply data science techniques to real-world financial data, enhancing their skills in data-driven decision-making and strategic analysis.
m
Synthetic Stroke Prediction Dataset
data.mendeley.com
kaggle.com
Updated May 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. http://doi.org/10.17632/s2nh6fm925.1
Explore at:
Unique identifier
https://doi.org/10.17632/s2nh6fm925.1
Dataset updated
May 2, 2025
Authors
Mohammed Borhan Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.
c
Harry Potter Sorting Dataset
cubig.ai
zip
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Harry Potter Sorting Dataset [Dataset]. https://cubig.ai/store/products/583/harry-potter-sorting-dataset
Explore at:
zipAvailable download formats
Dataset updated
Jul 14, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
Description
1) Data Introduction • The Harry Potter Sorting Dataset contains various attributes and Hogwarts dorm assignments of 1,000 virtual students in Harry Potter's worldview, and is designed to be used in machine learning classification exercises such as dorm classification based on student tendencies.

2) Data Utilization (1) Harry Potter Sorting Dataset has characteristics that: • Each student contains the actual assigned Hogwarts dorm (House) information, along with several attribute columns, including name, gender, ancestry, region of origin, personality traits, and magic-related abilities. • The House is divided into four categories: Gryffindor, Slytherin, Ravenclaw, and Hufflepuff. (2) Harry Potter Sorting Dataset can be used to: • Development of boarding classification model: Using student characteristic data, we can build a Hogwarts House classification machine learning model and evaluate prediction accuracy. • Data Science Practice and Training: It can be used for data science and machine learning training practices, such as characteristic selection, data preprocessing, and classification modeling.
e
Sample Geodata and Software for Demonstrating Geospatial Preprocessing for...
data.europa.eu
envidat.ch
+1more
png, tiff, unknown +1
Updated May 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EnviDat (2019). Sample Geodata and Software for Demonstrating Geospatial Preprocessing for Forest Accessibility and Wood Harvesting at FOSS4G2019 [Dataset]. https://data.europa.eu/data/datasets/d28614a0-0825-4040-bc1b-e0455b1e4df6-envidat?locale=de
Explore at:
png(391085), zip(66908), zip(50776), unknown(29318), unknown, tiff(1695063), zip(288311), zip(2083)Available download formats
Dataset updated
May 22, 2019
Dataset authored and provided by
EnviDat
License
http://dcat-ap.ch/vocabulary/licenses/terms_byhttp://dcat-ap.ch/vocabulary/licenses/terms_by
Description
This dataset contains open vector data for railways, forests and power lines, as well an open digital elevation model (DEM) for a small area around a sample forest range in Europe (Germany, Upper Bavaria, Kochel Forest Range, some 70 km south of München, at the edge of Bavarian Alps). The purpose of this dataset is to provide a documented sample dataset in order to demonstrate geospatial preprocessing at FOSS4G2019 based on open data and software. This sample has been produced based on several existing open data sources (detailed below), therefore documenting the sources for obtaining some data needed for computations related to forest accessibility and wood harvesting. For example, they can be used with the open methodology and QGIS plugin Seilaplan for optimising the geometric layout cable roads or with additional open software for computing the forest accessibility for wood harvesting. The vector data (railways, forests and power lines) was extracted from OpenStreetMap (data copyrighted OpenStreetMap contributors and available from https://www.openstreetmap.org). The railways and forests were downloaded and extracted on 18.05.2019 using the open sources QGIS (https://www.qgis.org) with the QuickOSM plugin, while the power lines were downloaded a couple of days later on 23.05.2019.

Additional notes for vector data: Please note that OpenStreeMap data extracts such as forests, roads and railways (except power lines) can also be downloaded in a GIS friendly format (Shapefile) from http://download.geofabrik.de/ or using the QGIS built-in download function for OpenStreetMap data. The most efficient way to retrieve specific OSM tags (such as power=line) is to use the QuickOSM plugin for QGIS (using the Overpass API - https://wiki.openstreetmap.org/wiki/Overpass_API) or directly using overpass turbo (https://overpass-turbo.eu/). Finally, the digitised perimeter of the sample forest range is also made available for reproducibility purposes, although any perimeter or area can be digitised freely using the QGIS editing toolbar.

The DEM was originally adapted and modified also with QGIS (https://www.qgis.org) based on the elevation data available from two different sources, by reprojecting and downsampling datasets to 25m then selecting, for each individual raster cell, the elevation value that was closer to the average. These two different elevation sources are:

Copernicus Land Monitoring Service - EU-DEM v.1.1 (TILE ID E40N20, downloaded from https://land.copernicus.eu/imagery-in-situ/eu-dem/eu-dem-v1.1; this original DEM was produced by the Copernicus Land Monitoring Service “with funding by the European Union” based on SRTM and ASTER GDEM)

Digitales Geländemodell 50 m Gitterweite (https://opendata.bayern.de/detailansicht/datensatz/digitales-gelaendemodell-50-m-gitterweite/), produced by the Bayerische Vermessungsverwaltung – www.geodaten.bayern.de –and downloaded from http://www.geodaten.bayern.de/opendata/DGM50/dgm50_epsg4258.tif

This methodology was chosen as a way of performing a basic quality check, by comparing the EU-DEM v.1.1 derived from globally available DEM data (such as SRTM) with more authoritative data for the randomly selected region, since using authoritative data is preferred (if open and available). For other sample regions, where authoritative open data is not available, such comparisons cannot longer be performed.

Additional notes DEM: a very good DEM open data source for Germany is the open data set collected and resampled by Sonny (sonnyy7@gmail.com) and made available on the Austrian Open Data Portal http://data.opendataportal.at/dataset/dtm-germany. In order to simplify end-to-end reproducibility of the paper planned for FOSS4G2019, we use and distribute an adapted (reprojected and resampled to 25 meters) sample of the above mentioned dataset for the selected forest range.

This sample dataset is accompanied by software in Python, as a Jupiter Notebook that generates harmonized output rasters with the same extent from the input data. The extent is given by the polygon vector dataset (Perimeter). These output rasters, such as obstacles, aspect, slope, forest cover, can serve as input data for later computations related to forest accessibility and wood harvesting questions. The obstacles output is obtained by transforming line vector datasets (railway lines, high voltage power lines) to raster. Aspect and slope are both derived from the sample digital elevation model.
Sample dataset for the models trained and tested in the paper 'Can AI be...
zenodo.org
zip
Updated Aug 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12934521
Dataset updated
Aug 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

To use the data, clone the corresponding repository and unzip this zip file in the data folder.
f
Preprocessing steps.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jun 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle (2024). Preprocessing steps. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001483628
Explore at:
Dataset updated
Jun 28, 2024
Authors
Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle
Description
In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.
f
Assessment of the sufficiency of information provided for reproducibility.
figshare.com
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill (2023). Assessment of the sufficiency of information provided for reproducibility. [Dataset]. http://doi.org/10.1371/journal.pdig.0000145.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000145.t001
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS Digital Health
Authors
Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Assessment of the sufficiency of information provided for reproducibility.
c
Random Sample of NIH Chest X ray Dataset
cubig.ai
zip
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Random Sample of NIH Chest X ray Dataset [Dataset]. https://cubig.ai/store/products/354/random-sample-of-nih-chest-x-ray-dataset
Explore at:
zipAvailable download formats
Dataset updated
May 28, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Measurement technique
Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
Description
1) Data Introduction • The Random Sample of NIH Chest X-ray Dataset is a sample version of a large public medical imaging dataset containing 112,120 chest X-ray images and 15 disease (or normal) labels collected from 30,805 patients.

2) Data Utilization (1) Random Sample of NIH Chest X-ray Dataset has characteristics that: • Each sample comes with detailed metadata such as image file name, disease label, patient ID, age, gender, direction of shooting, and image size, and the label extracts the radiographic reading report with NLP, showing an accuracy of more than 90%. • It contains 5,606 1024x1024 size images, consisting of 14 diseases and a 'No Finding' class, but due to the nature of the sample, some disease data are very scarce. (2) Random Sample of NIH Chest X-ray Dataset can be used to: • Development of chest disease image reading AI: Using X-ray images with various chest disease labels, deep learning-based automatic diagnosis and classification models can be trained and evaluated. • Medical image data preprocessing and labeling research: It can be used for medical artificial intelligence research and algorithm development such as automatic labeling of large medical image datasets, data quality evaluation, and weak-supervised learning.
f
Comparison of original and reimplementation results of Liu paper.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill (2023). Comparison of original and reimplementation results of Liu paper. [Dataset]. http://doi.org/10.1371/journal.pdig.0000145.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000145.t004
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS Digital Health
Authors
Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of original and reimplementation results of Liu paper.
f
Comparison of original and reimplementation results of Wang paper.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill (2023). Comparison of original and reimplementation results of Wang paper. [Dataset]. http://doi.org/10.1371/journal.pdig.0000145.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000145.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS Digital Health
Authors
Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of original and reimplementation results of Wang paper.
P
Pharmaceutical Sample Preprocessing System Report
datainsightsmarket.com
doc, pdf, ppt
Updated Nov 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Pharmaceutical Sample Preprocessing System Report [Dataset]. https://www.datainsightsmarket.com/reports/pharmaceutical-sample-preprocessing-system-201252
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Nov 7, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Explore the dynamic Pharmaceutical Sample Preprocessing System market with insights on growth drivers, trends, restraints, and regional analysis. Discover market size projections and key players shaping the future of drug discovery and diagnostics.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning

Explore at:

zip(2028853 bytes)Available download formats

Dataset updated

Sep 19, 2024

Authors

Shrishti Manja

Description

This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Clear search

Close search

Google apps

Main menu

Ecommerce Dataset for Data Analysis

Daily Machine Learning Practice

Raw Medical Dataset for Cleaning Practice

Data set for article: Effect of data preprocessing and machine learning...

Dataset for practice session 1 in bioinformatics

Retail Product Dataset with Missing Values

Fruit Tabular Classification Dataset

DA_2DCHROM - Sample dataset

Data from: Assessing predictive performance of supervised machine learning...

USA Bank Financial Data

Synthetic Stroke Prediction Dataset

Harry Potter Sorting Dataset

Sample Geodata and Software for Demonstrating Geospatial Preprocessing for...

Sample dataset for the models trained and tested in the paper 'Can AI be...

Preprocessing steps.

Assessment of the sufficiency of information provided for reproducibility.

Random Sample of NIH Chest X ray Dataset

Comparison of original and reimplementation results of Liu paper.

Comparison of original and reimplementation results of Wang paper.

Pharmaceutical Sample Preprocessing System Report

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning