100+ datasets found
  1. Ecommerce Dataset for Data Analysis

    • kaggle.com
    zip
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
    Explore at:
    zip(2028853 bytes)Available download formats
    Dataset updated
    Sep 19, 2024
    Authors
    Shrishti Manja
    Description

    This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

    About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

    Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

    This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

    This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

  2. Daily Machine Learning Practice

    • kaggle.com
    zip
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Astrid Villalobos (2025). Daily Machine Learning Practice [Dataset]. https://www.kaggle.com/datasets/astridvillalobos/daily-machine-learning-practice
    Explore at:
    zip(1019861 bytes)Available download formats
    Dataset updated
    Nov 9, 2025
    Authors
    Astrid Villalobos
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Daily Machine Learning Practice – 1 Commit per Day

    Author: Astrid Villalobos Location: Montréal, QC LinkedIn: https://www.linkedin.com/in/astridcvr/

    Objective The goal of this project is to strengthen Machine Learning and data analysis skills through small, consistent daily contributions. Each commit focuses on a specific aspect of data processing, feature engineering, or modeling using Python, Pandas, and Scikit-learn.

    Dataset Source: Kaggle – Sample Sales Data File: data/sales_data_sample.csv Variables: ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, COUNTRY, etc. Goal: Analyze e-commerce performance, predict sales trends, segment customers, and forecast demand.

    **Project Rules **Rule Description 🟩 1 Commit per Day Minimum one line of code daily to ensure consistency and discipline 🌍 Bilingual Comments Code and documentation in English and French 📈 Visible Progress Daily green squares = daily learning 🧰 Tech Stack

    Languages: Python Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn Tools: Jupyter Notebook, GitHub, Kaggle

    Learning Outcomes By the end of this challenge: Develop a stronger understanding of data preprocessing, modeling, and evaluation. Build consistent coding habits through daily practice. Apply ML techniques to real-world sales data scenarios.

  3. Raw Medical Dataset for Cleaning Practice

    • kaggle.com
    zip
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aamir Shahzad (2025). Raw Medical Dataset for Cleaning Practice [Dataset]. https://www.kaggle.com/datasets/aamir5659/raw-medical-dataset-for-cleaning-practice/code
    Explore at:
    zip(1668 bytes)Available download formats
    Dataset updated
    Jul 5, 2025
    Authors
    Aamir Shahzad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the raw medical dataset used in my data cleaning project. It contains original, unprocessed data with missing values, inconsistent formatting, and possible duplicates. This dataset is ideal for practicing data cleaning, preprocessing, and exploratory data analysis (EDA).

    Note: This dataset is anonymized and intended for educational purposes only.

  4. l

    Data set for article: Effect of data preprocessing and machine learning...

    • opal.latrobe.edu.au
    • researchdata.edu.au
    hdf
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wil Gardner (2024). Data set for article: Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging models [Dataset]. http://doi.org/10.26181/22671022.v1
    Explore at:
    hdfAvailable download formats
    Dataset updated
    Mar 7, 2024
    Dataset provided by
    La Trobe
    Authors
    Wil Gardner
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data set is uploaded as supporting information for the publication entitled:Effect of data preprocessing and machine learning hyperparameters on mass spectrometry imaging modelsFiles are as follows:polymer_microarray_data.mat - MATLAB workspace file containing peak-picked ToF-SIMS data (hyperspectral array) for the polymer microarray sample.nylon_data.mat - MATLAB workspace file containing m/z binned ToF-SIMS data (hyperspectral array) for the semi-synthetic nylon data set, generated from 7 nylon samples.Additional details about the datasets can be found in the published article.If you use this data set in your work, please cite our work as follows:Cite as: Gardner et al.. J. Vac. Sci. Technol. A 41, 000000 (2023); doi: 10.1116/6.0002788

  5. Dataset for practice session 1 in bioinformatics

    • figshare.com
    txt
    Updated Jul 17, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Sugis (2016). Dataset for practice session 1 in bioinformatics [Dataset]. http://doi.org/10.6084/m9.figshare.3490211.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 17, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Elena Sugis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for the practice in the data preprocessing and unsupervised learning in the introduction to bioinformatics course

  6. Retail Product Dataset with Missing Values

    • kaggle.com
    zip
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himel Sarder (2025). Retail Product Dataset with Missing Values [Dataset]. https://www.kaggle.com/datasets/himelsarder/retail-product-dataset-with-missing-values
    Explore at:
    zip(47826 bytes)Available download formats
    Dataset updated
    Feb 17, 2025
    Authors
    Himel Sarder
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).

    The dataset includes:
    - Category (Categorical): Product category (A, B, C, D)
    - Price (Numerical): Randomized product prices
    - Rating (Numerical): Ratings between 1 to 5
    - Stock (Categorical): Availability status (In Stock, Out of Stock)
    - Discount (Numerical): Discount percentage

    This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.

  7. c

    Fruit Tabular Classification Dataset

    • cubig.ai
    zip
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Fruit Tabular Classification Dataset [Dataset]. https://cubig.ai/store/products/563/fruit-tabular-classification-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
    Description

    1) Data Introduction • The Fruit Classification Dataset is a beginner classification dataset configured to classify fruit types based on fruit name, color, and weight information.

    2) Data Utilization (1) Fruit Classification Dataset has characteristics that: • This dataset consists of a total of three columns: categorical variable Color, continuous variable Weight, and target class Fruit, allowing you to pre-process categorical and numerical variables when learning classification models. (2) Fruit Classification Dataset can be used to: • Model learning and evaluation: It can be used as educational and research experimental data to compare and evaluate the performance of various machine learning classification algorithms using color and weight characteristics. • Data preprocessing practice: can be used as hands-on data to learn basic data preprocessing and feature engineering courses such as categorical variable encoding and continuous variable scaling.

  8. Z

    DA_2DCHROM - Sample dataset

    • data.niaid.nih.gov
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ladislavova Nikola; Pojmanova Petra (2022). DA_2DCHROM - Sample dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7068335
    Explore at:
    Dataset updated
    Sep 12, 2022
    Dataset provided by
    UCT Prague, Department of Analytical chemistry
    Authors
    Ladislavova Nikola; Pojmanova Petra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary materials for appliacation of DA_2DCHROM - data alignment

    https://doi.org/10.5281/zenodo.7040975

    Content:

    data – folder for raw data

    full_dataset_alignment – 100 graphical comparisons of randomly picked pairs from full dataset

    graph_results – graphical representation of obtained results

    metadata – results of midsteps in data alignment process

    data folder:

    Subfolder Sample_dataset contains 20 sample chromatograms, each processed for S/N 100, 300 and 500 level

    full_dataset_alignment:

    100 graphical comparisons of randomly picked pairs from full dataset. The pairs are same for both algorithms.

    graph_results folder:

    To reduce the total size of Supplementary materials, only results for S/N level 500 are exported.

    Each subfolder (names of folder correspond with the names of algorithms used through the study) contains numerical (K-S test) and graphical representation of the alignment. In case of failed alignment (not enough anchor points in case of BiPACE2D for example), the graphs are left blank.

    metadata folder:

          merged_peaks folder
    

    Folder containing formated data with merged peaks (results of preprocessing part of data_alignment_chromatography_v1.4 script)

          ref_data folder
    

    Lists of manually exported referential peaks for each chromatogram. Input data for RI algorhitm.

          time_correction folder
    

    Each algorithm subfolder contains the result of data alignment itself. For each aligned chromatogram, there are 3 files – aligned chromatogram itself (.txt file with the most bytes), lists of detected anchor peaks (.txt with _anchors extension), and simple graphical check of alignment itself (.png)

  9. n

    Data from: Assessing predictive performance of supervised machine learning...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Strathmore University
    Authors
    Evans Omondi
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

  10. USA Bank Financial Data

    • kaggle.com
    zip
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VISHAL SINGH SANGRAL (2024). USA Bank Financial Data [Dataset]. https://www.kaggle.com/datasets/vishalsinghsangral/usa-bank-financial-data
    Explore at:
    zip(20684 bytes)Available download formats
    Dataset updated
    Jun 28, 2024
    Authors
    VISHAL SINGH SANGRAL
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Description:

    The myusabank.csv dataset contains daily financial data for a fictional bank (MyUSA Bank) over a two-year period. It includes various key financial metrics such as interest income, interest expense, average earning assets, net income, total assets, shareholder equity, operating expenses, operating income, market share, and stock price. The data is structured to simulate realistic scenarios in the banking sector, including outliers, duplicates, and missing values for educational purposes.

    Potential Student Tasks:

    1. Data Cleaning and Preprocessing:

      • Handle missing values, duplicates, and outliers to ensure data integrity.
      • Normalize or scale data as needed for analysis.
    2. Exploratory Data Analysis (EDA):

      • Visualize trends and distributions of financial metrics over time.
      • Identify correlations between different financial indicators.
    3. Calculating Key Performance Indicators (KPIs):

      • Compute metrics such as Net Interest Margin (NIM), Return on Assets (ROA), Return on Equity (ROE), and Cost-to-Income Ratio using calculated fields.
      • Analyze the financial health and performance of MyUSA Bank based on these KPIs.
    4. Building Tableau Dashboards:

      • Design interactive dashboards to present insights and trends.
      • Include summary cards, bar charts, line charts, and pie charts to visualize financial performance metrics.
    5. Forecasting and Predictive Modeling:

      • Use historical data to forecast future financial performance.
      • Apply regression or time series analysis to predict market share or stock price movements.
    6. Business Insights and Reporting:

      • Interpret findings to derive actionable insights for bank management.
      • Prepare reports or presentations summarizing key findings and recommendations.

    Educational Goals:

    The dataset aims to provide hands-on experience in data preprocessing, analysis, and visualization within the context of banking and finance. It encourages students to apply data science techniques to real-world financial data, enhancing their skills in data-driven decision-making and strategic analysis.

  11. m

    Synthetic Stroke Prediction Dataset

    • data.mendeley.com
    • kaggle.com
    Updated May 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. http://doi.org/10.17632/s2nh6fm925.1
    Explore at:
    Dataset updated
    May 2, 2025
    Authors
    Mohammed Borhan Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.

  12. c

    Harry Potter Sorting Dataset

    • cubig.ai
    zip
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Harry Potter Sorting Dataset [Dataset]. https://cubig.ai/store/products/583/harry-potter-sorting-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 14, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
    Description

    1) Data Introduction • The Harry Potter Sorting Dataset contains various attributes and Hogwarts dorm assignments of 1,000 virtual students in Harry Potter's worldview, and is designed to be used in machine learning classification exercises such as dorm classification based on student tendencies.

    2) Data Utilization (1) Harry Potter Sorting Dataset has characteristics that: • Each student contains the actual assigned Hogwarts dorm (House) information, along with several attribute columns, including name, gender, ancestry, region of origin, personality traits, and magic-related abilities. • The House is divided into four categories: Gryffindor, Slytherin, Ravenclaw, and Hufflepuff. (2) Harry Potter Sorting Dataset can be used to: • Development of boarding classification model: Using student characteristic data, we can build a Hogwarts House classification machine learning model and evaluate prediction accuracy. • Data Science Practice and Training: It can be used for data science and machine learning training practices, such as characteristic selection, data preprocessing, and classification modeling.

  13. e

    Sample Geodata and Software for Demonstrating Geospatial Preprocessing for...

    • data.europa.eu
    • envidat.ch
    • +1more
    png, tiff, unknown +1
    Updated May 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EnviDat (2019). Sample Geodata and Software for Demonstrating Geospatial Preprocessing for Forest Accessibility and Wood Harvesting at FOSS4G2019 [Dataset]. https://data.europa.eu/data/datasets/d28614a0-0825-4040-bc1b-e0455b1e4df6-envidat?locale=de
    Explore at:
    png(391085), zip(66908), zip(50776), unknown(29318), unknown, tiff(1695063), zip(288311), zip(2083)Available download formats
    Dataset updated
    May 22, 2019
    Dataset authored and provided by
    EnviDat
    License

    http://dcat-ap.ch/vocabulary/licenses/terms_byhttp://dcat-ap.ch/vocabulary/licenses/terms_by

    Description

    This dataset contains open vector data for railways, forests and power lines, as well an open digital elevation model (DEM) for a small area around a sample forest range in Europe (Germany, Upper Bavaria, Kochel Forest Range, some 70 km south of München, at the edge of Bavarian Alps). The purpose of this dataset is to provide a documented sample dataset in order to demonstrate geospatial preprocessing at FOSS4G2019 based on open data and software. This sample has been produced based on several existing open data sources (detailed below), therefore documenting the sources for obtaining some data needed for computations related to forest accessibility and wood harvesting. For example, they can be used with the open methodology and QGIS plugin Seilaplan for optimising the geometric layout cable roads or with additional open software for computing the forest accessibility for wood harvesting. The vector data (railways, forests and power lines) was extracted from OpenStreetMap (data copyrighted OpenStreetMap contributors and available from https://www.openstreetmap.org). The railways and forests were downloaded and extracted on 18.05.2019 using the open sources QGIS (https://www.qgis.org) with the QuickOSM plugin, while the power lines were downloaded a couple of days later on 23.05.2019.

    Additional notes for vector data: Please note that OpenStreeMap data extracts such as forests, roads and railways (except power lines) can also be downloaded in a GIS friendly format (Shapefile) from http://download.geofabrik.de/ or using the QGIS built-in download function for OpenStreetMap data. The most efficient way to retrieve specific OSM tags (such as power=line) is to use the QuickOSM plugin for QGIS (using the Overpass API - https://wiki.openstreetmap.org/wiki/Overpass_API) or directly using overpass turbo (https://overpass-turbo.eu/). Finally, the digitised perimeter of the sample forest range is also made available for reproducibility purposes, although any perimeter or area can be digitised freely using the QGIS editing toolbar.

    The DEM was originally adapted and modified also with QGIS (https://www.qgis.org) based on the elevation data available from two different sources, by reprojecting and downsampling datasets to 25m then selecting, for each individual raster cell, the elevation value that was closer to the average. These two different elevation sources are:

    This methodology was chosen as a way of performing a basic quality check, by comparing the EU-DEM v.1.1 derived from globally available DEM data (such as SRTM) with more authoritative data for the randomly selected region, since using authoritative data is preferred (if open and available). For other sample regions, where authoritative open data is not available, such comparisons cannot longer be performed.

    Additional notes DEM: a very good DEM open data source for Germany is the open data set collected and resampled by Sonny (sonnyy7@gmail.com) and made available on the Austrian Open Data Portal http://data.opendataportal.at/dataset/dtm-germany. In order to simplify end-to-end reproducibility of the paper planned for FOSS4G2019, we use and distribute an adapted (reprojected and resampled to 25 meters) sample of the above mentioned dataset for the selected forest range.

    This sample dataset is accompanied by software in Python, as a Jupiter Notebook that generates harmonized output rasters with the same extent from the input data. The extent is given by the polygon vector dataset (Perimeter). These output rasters, such as obstacles, aspect, slope, forest cover, can serve as input data for later computations related to forest accessibility and wood harvesting questions. The obstacles output is obtained by transforming line vector datasets (railway lines, high voltage power lines) to raster. Aspect and slope are both derived from the sample digital elevation model.

  14. Sample dataset for the models trained and tested in the paper 'Can AI be...

    • zenodo.org
    zip
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti (2024). Sample dataset for the models trained and tested in the paper 'Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy' [Dataset]. http://doi.org/10.5281/zenodo.12934521
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Elena Tomasi; Elena Tomasi; Gabriele Franch; Gabriele Franch; Marco Cristoforetti; Marco Cristoforetti
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This repository contains a sample of the input data for the models of the preprint "Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over Italy". It allows the user to test and train the models on a reduced dataset (45GB).

    This sample dataset comprises ~3 years of normalized hourly data for both low-resolution predictors and high-resolution target variables. Data has been randomly picked from the whole dataset, from 2000 to 2020, with 70% of data coming from the original training dataset, 15% from the original validation dataset, and 15% from the original test dataset. Low-resolution data are preprocessed ERA5 data while high-resolution data are preprocessed VHR-REA CMCC data. Details on the performed preprocessing are available in the paper.

    This sample dataset also includes files relative to metadata, static data, normalization, and plotting.

    To use the data, clone the corresponding repository and unzip this zip file in the data folder.

  15. f

    Preprocessing steps.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle (2024). Preprocessing steps. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001483628
    Explore at:
    Dataset updated
    Jun 28, 2024
    Authors
    Kim, Min-Hee; Ahn, Hyeong Jun; Ishikawa, Kyle
    Description

    In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.

  16. f

    Assessment of the sufficiency of information provided for reproducibility.

    • figshare.com
    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill (2023). Assessment of the sufficiency of information provided for reproducibility. [Dataset]. http://doi.org/10.1371/journal.pdig.0000145.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS Digital Health
    Authors
    Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Assessment of the sufficiency of information provided for reproducibility.

  17. c

    Random Sample of NIH Chest X ray Dataset

    • cubig.ai
    zip
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CUBIG (2025). Random Sample of NIH Chest X ray Dataset [Dataset]. https://cubig.ai/store/products/354/random-sample-of-nih-chest-x-ray-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 28, 2025
    Dataset authored and provided by
    CUBIG
    License

    https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service

    Measurement technique
    Synthetic data generation using AI techniques for model training, Privacy-preserving data transformation via differential privacy
    Description

    1) Data Introduction • The Random Sample of NIH Chest X-ray Dataset is a sample version of a large public medical imaging dataset containing 112,120 chest X-ray images and 15 disease (or normal) labels collected from 30,805 patients.

    2) Data Utilization (1) Random Sample of NIH Chest X-ray Dataset has characteristics that: • Each sample comes with detailed metadata such as image file name, disease label, patient ID, age, gender, direction of shooting, and image size, and the label extracts the radiographic reading report with NLP, showing an accuracy of more than 90%. • It contains 5,606 1024x1024 size images, consisting of 14 diseases and a 'No Finding' class, but due to the nature of the sample, some disease data are very scarce. (2) Random Sample of NIH Chest X-ray Dataset can be used to: • Development of chest disease image reading AI: Using X-ray images with various chest disease labels, deep learning-based automatic diagnosis and classification models can be trained and evaluated. • Medical image data preprocessing and labeling research: It can be used for medical artificial intelligence research and algorithm development such as automatic labeling of large medical image datasets, data quality evaluation, and weak-supervised learning.

  18. f

    Comparison of original and reimplementation results of Liu paper.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill (2023). Comparison of original and reimplementation results of Liu paper. [Dataset]. http://doi.org/10.1371/journal.pdig.0000145.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS Digital Health
    Authors
    Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of original and reimplementation results of Liu paper.

  19. f

    Comparison of original and reimplementation results of Wang paper.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill (2023). Comparison of original and reimplementation results of Wang paper. [Dataset]. http://doi.org/10.1371/journal.pdig.0000145.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS Digital Health
    Authors
    Christina Fell; Mahnaz Mohammadi; David Morrison; Ognjen Arandjelovic; Peter Caie; David Harris-Birtill
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of original and reimplementation results of Wang paper.

  20. P

    Pharmaceutical Sample Preprocessing System Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Pharmaceutical Sample Preprocessing System Report [Dataset]. https://www.datainsightsmarket.com/reports/pharmaceutical-sample-preprocessing-system-201252
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Explore the dynamic Pharmaceutical Sample Preprocessing System market with insights on growth drivers, trends, restraints, and regional analysis. Discover market size projections and key players shaping the future of drug discovery and diagnostics.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Organization logo

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning

Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description

This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Search
Clear search
Close search
Google apps
Main menu