15 datasets found
  1. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  2. t

    Tour Recommendation Model

    • test.researchdata.tuwien.at
    bin, png +1
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
    Explore at:
    text/markdown, png, binAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Dataset Description for Tour Recommendation Model

    Context and Methodology:

    • Research Domain/Project:
      This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

    • Purpose:
      The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

    • Creation Methodology:
      The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

    Technical Details:

    • Structure of the Dataset:
      The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

      • place_or_event_id: Unique identifier for each tourist place or event.

      • rating: Rating given by the user, ranging from 1 to 5.

      The data is split into three subsets:

      • Training Set: 80% of the dataset used to train the model.

      • Validation Set: A small portion used for hyperparameter tuning.

      • Test Set: 20% used to evaluate model performance.

    • Folder and File Naming Conventions:
      The dataset files are stored in the following structure:

      • user_ratings_dataset.csv: The original dataset file containing user ratings.

      • tour_recommendation_model.pkl: The saved model after training.

      • actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

    • Software Requirements:
      To open and work with this dataset, the following software and libraries are required:

      • Python 3.x

      • Pandas for data manipulation

      • Scikit-learn for training and evaluating machine learning models

      • Matplotlib for chart generation

      • Joblib for saving and loading the trained model

      The dataset can be opened and processed using any Python environment that supports these libraries.

    • Additional Resources:

      • The model training code, README file, and performance chart are available in the project repository.

      • For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

    Further Details:

    • Dataset Reusability:
      The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

      • Train other types of models (e.g., regression, classification).

      • Experiment with different features or add more metadata to enrich the dataset.

    • Data Integrity:
      The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

    • Licensing:
      The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

  3. t

    Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

    • researchdata.tuwien.ac.at
    • test.researchdata.tuwien.ac.at
    • +1more
    bin, text/markdown
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns (2025). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.48436/q1kt0-edc53
    Explore at:
    bin, text/markdownAvailable download formats
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    TU Wien
    Authors
    Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2024 - Aug 2024
    Description

    Dataset Card for "privacy-care-interactions"

    Table of Contents

    Dataset Description

    Purpose and Features

    🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

    The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

    Dataset Overview

    Language Distribution 🌍

    • English (en): 95

    Locale Distribution 🌎

    • United States (US) 🇺🇸: 95

    Key Facts 🔑

    • This is synthetic data! Generated using proprietary algorithms - no privacy violations!
    • Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).
    • The data was manually labeled by an expert.

    Dataset Structure

    Data Instances

    The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

    { "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

    Data Fields

    The data fields are:

    • text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).
    • taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.
    • category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.
    • affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.
    • language: a string feature. Language code as defined by ISO 639.
    • locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.
    • data_type: a string a classification label, with possible values including real (0), synthetic (1).
    • uid: a int64 feature. A unique identifier within the dataset.
    • split: a string feature. Either train, validation or test.

    Dataset Splits

    The dataset has 2 subsets:

    • split: with a total of 95 examples split into train, validation and test (70%-15%-15%)
    • unsplit: with a total of 95 examples in a single train split
    nametrainvalidationtest
    split661415
    unsplit95n/an/a

    The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

    • split-train-en.jsonl
    • split-validation-en.jsonl
    • split-test-en.jsonl
    • unsplit-train-en.jsonl

    Dataset Creation

    Curation Rationale

    Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

    Source Data

    Initial Data Collection

    The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

    Data Processing

    The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the sections were translated from German to U.S. English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank" rel="noopener

  4. S

    machine learning models on the WDBC dataset

    • scidb.cn
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Mahdi Aghaziarati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.

  5. n

    Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

    • narcis.nl
    • data.mendeley.com
    Updated Jan 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
    Explore at:
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Yoo, T (via Mendeley Data)
    Description

    Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

    We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

    This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

    This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

    Python version:

    from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

    connect data in your google drive

    from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

    Change the path for the custom data

    In this case, we used ICL vault prediction using preop measurement

    dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

    optimal features (sorted by importance) :

    1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

    7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

    y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

    Split the dataset to train and test data, if necessary.

    For example, we can split data to 8:2 as a simple validation test

    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

    In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

    Optimal parameter search could be performed in this section

    parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

    RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_

  6. P

    Embrapa ADD 256 Dataset

    • paperswithcode.com
    Updated Oct 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Embrapa ADD 256 Dataset [Dataset]. https://paperswithcode.com/dataset/embrapa-add-256
    Explore at:
    Dataset updated
    Oct 23, 2021
    Description

    This is a detailed description of the dataset, a data sheet for the dataset as proposed by Gebru et al.

    Motivation for Dataset Creation Why was the dataset created? Embrapa ADD 256 (Apples by Drones Detection Dataset — 256 × 256) was created to provide images and annotation for research on *apple detection in orchards for UAV-based monitoring in apple production.

    What (other) tasks could the dataset be used for? Apple detection in low-resolution scenarios, similar to the aerial images employed here.

    Who funded the creation of the dataset? The building of the ADD256 dataset was supported by the Embrapa SEG Project 01.14.09.001.05.04, Image-based metrology for Precision Agriculture and Phenotyping, and FAPESP under grant (2017/19282-7).

    Dataset Composition What are the instances? Each instance consists of an RGB image and an annotation describing apples locations as circular markers (i.e., presenting center and radius).

    How many instances of each type are there? The dataset consists of 1,139 images containing 2,471 apples.

    What data does each instance consist of? Each instance contains an 8-bits RGB image. Its corresponding annotation is found in the JSON files: each apple marker is composed by its center (cx, cy) and its radius (in pixels), as seen below:

    "gebler-003-06.jpg": [ { "cx": 116, "cy": 117, "r": 10 }, { "cx": 134, "cy": 113, "r": 10 }, { "cx": 221, "cy": 95, "r": 11 }, { "cx": 206, "cy": 61, "r": 11 }, { "cx": 92, "cy": 1, "r": 10 } ],

    Dataset.ipynb is a Jupyter Notebook presenting a code example for reading the data as a PyTorch's Dataset (it should be straightforward to adapt the code for other frameworks as Keras/TensorFlow, fastai/PyTorch, Scikit-learn, etc.)

    Is everything included or does the data rely on external resources? Everything is included in the dataset.

    Are there recommended data splits or evaluation measures? The dataset comes with specified train/test splits. The splits are found in lists stored as JSON files.

    | | Number of images | Number of annotated apples | | --- | --- | --- | |Training | 1,025 | 2,204 | |Test | 114 | 267 | |Total | 1,139 | 2,471 |

    Dataset recommended split.

    Standard measures from the information retrieval and computer vision literature should be employed: precision and recall, F1-score and average precision as seen in COCO and Pascal VOC.

    What experiments were initially run on this dataset? The first experiments run on this dataset are described in A methodology for detection and location of fruits in apples orchards from aerial images by Santos & Gebler (2021).

    Data Collection Process How was the data collected? The data employed in the development of the methodology came from two plots located at the Embrapa’s Temperate Climate Fruit Growing Experimental Station at Vacaria-RS (28°30’58.2”S, 50°52’52.2”W). Plants of the varieties Fuji and Gala are present in the dataset, in equal proportions. The images were taken during December 13, 2018, by an UAV (DJI Phantom 4 Pro) that flew over the rows of the field at a height of 12 m. The images mix nadir and non-nadir views, allowing a more extensive view of the canopies. A subset from the images was random selected and 256 × 256 pixels patches were extracted.

    Who was involved in the data collection process? T. T. Santos and L. Gebler captured the images in field. T. T. Santos performed the annotation.

    How was the data associated with each instance acquired? The circular markers were annotated using the VGG Image Annotator (VIA).

    WARNING: Find non-ripe apples in low-resolution images of orchards is a challenging task even for humans. ADD256 was annotated by a single annotator. So, users of this dataset should consider it a noisy dataset.

    Data Preprocessing What preprocessing/cleaning was done? No preprocessing was applied.

    Dataset Distribution How is the dataset distributed? The dataset is available at GitHub.

    When will the dataset be released/first distributed? The dataset was released in October 2021.

    What license (if any) is it distributed under? The data is released under Creative Commons BY-NC 4.0 (Attribution-NonCommercial 4.0 International license). There is a request to cite the corresponding paper if the dataset is used. For commercial use, contact Embrapa Agricultural Informatics business office.

    Are there any fees or access/export restrictions? There are no fees or restrictions. For commercial use, contact Embrapa Agricultural Informatics business office.

    Dataset Maintenance Who is supporting/hosting/maintaining the dataset? The dataset is hosted at Embrapa Agricultural Informatics and all comments or requests can be sent to Thiago T. Santos (maintainer).

    Will the dataset be updated? There is no scheduled updates.

    If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? Contributors should contact the maintainer by e-mail.

    No warranty The maintainers and their institutions are exempt from any liability, judicial or extrajudicial, for any losses or damages arising from the use of the data contained in the image database.

  7. Titanic Dataset Competition

    • kaggle.com
    Updated Dec 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cynthia Barasa (2022). Titanic Dataset Competition [Dataset]. https://www.kaggle.com/datasets/cynthycynthy/titanicdataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Cynthia Barasa
    Description

    The Titanic dataset is a well-known dataset that provides information on the passengers who were onboard the fateful voyage of the RMS Titanic. The data includes details such as the passenger's name, age, gender, ticket class, fare paid, and information on their family members. The dataset also includes a column called "Survived" which indicates whether a passenger survived the disaster or not.

    There are a total of 891 rows in the dataset, with 12 columns. Some of the key columns in the dataset include:

    PassengerId: a unique identifier for each passenger • Survived: a binary variable that indicates whether the passenger survived (1) or did not survive (0) the disaster • Pclass: the ticket class of the passenger (1 = first class, 2 = second class, 3 = third class) • Name: the name of the passenger • Sex: the gender of the passenger (male or female) • Age: the age of the passenger (some values are missing) • SibSp: the number of siblings or spouses the passenger had on board • Parch: the number of parents or children the passenger had on board • Ticket: the ticket number of the passenger • Fare: the fare paid by the passenger • Cabin: the cabin number of the passenger (some values are missing) • Embarked: the port at which the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

    Overall, the key challenges I encountered when working on the Titanic dataset were: how to handle missing values and imbalanced classes, encode categorical variables, reduce the dimensionality of the dataset, and identify and handle noise in the data.

    Here are a few tips and resources that I found helpful when getting started in the Titanic dataset competition: 1. Get familiar with the dataset 2. Pre-process the data 3. Split the data into training and test sets 4. Try out a few different algorithms 5. Tune the hyper parameters 6. Evaluate the model

    Here are a few resources that I found helpful as I started Working on the competition: • Kaggle's Titanic tutorial • scikit-learn documentation. • Pandas documentation

  8. f

    Aluminum alloy industrial materials defect

    • figshare.com
    zip
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ying Han; Yugang Wang (2024). Aluminum alloy industrial materials defect [Dataset]. http://doi.org/10.6084/m9.figshare.27922929.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    figshare
    Authors
    Ying Han; Yugang Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset used in this study experiment was from the preliminary competition dataset of the 2018 Guangdong Industrial Intelligent Manufacturing Big Data Intelligent Algorithm Competition organized by Tianchi Feiyue Cloud (https://tianchi.aliyun.com/competition/entrance/231682/introduction). We have selected the dataset, removing images that do not meet the requirements of our experiment. All datasets have been classified for training and testing. The image pixels are all 2560×1960. Before training, all defects need to be labeled using labelimg and saved as json files. Then, all json files are converted to txt files. Finally, the organized defect dataset is detected and classified.Description of the data and file structureThis is a project based on the YOLOv8 enhanced algorithm for aluminum defect classification and detection tasks.All code has been tested on Windows computers with Anaconda and CUDA-enabled GPUs. The following instructions allow users to run the code in this repository based on a Windows+CUDA GPU system already in use.Files and variablesFile: defeat_dataset.zipDescription:SetupPlease follow the steps below to set up the project:Download Project RepositoryDownload the project repository defeat_dataset.zip from the following location.Unzip and navigate to the project folder; it should contain a subfolder: quexian_datasetDownload data1.Download data .defeat_dataset.zip2.Unzip the downloaded data and move the 'defeat_dataset' folder into the project's main folder.3. Make sure that your defeat_dataset folder now contains a subfolder: quexian_dataset.4. Within the folder you should find various subfolders such as addquexian-13, quexian_dataset, new_dataset-13, etc.softwareSet up the Python environment1.Download and install the Anaconda.2.Once Anaconda is installed, activate the Anaconda Prompt. For Windows, click Start, search for Anaconda Prompt, and open it.3.Create a new conda environment with Python 3.8. You can name it whatever you like; for example. Enter the following command: conda create -n yolov8 python=3.84.Activate the created environment. If the name is , enter: conda activate yolov8Download and install the Visual Studio Code.Install PyTorch based on your system:For Windows/Linux users with a CUDA GPU: bash conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forgeInstall some necessary libraries:Install scikit-learn with the command: conda install anaconda scikit-learn=0.24.1Install astropy with: conda install astropy=4.2.1Install pandas using: conda install anaconda pandas=1.2.4Install Matplotlib with: conda install conda-forge matplotlib=3.5.3Install scipy by entering: conda install scipy=1.10.1RepeatabilityFor PyTorch, it's a well-known fact:There is no guarantee of fully reproducible results between PyTorch versions, individual commits, or different platforms. In addition, results may not be reproducible between CPU and GPU executions, even if the same seed is used.All results in the Analysis Notebook that involve only model evaluation are fully reproducible. However, when it comes to updating the model on the GPU, the results of model training on different machines vary.Access informationOther publicly accessible locations of the data:https://tianchi.aliyun.com/dataset/public/Data was derived from the following sources:https://tianchi.aliyun.com/dataset/140666Data availability statementThe ten datasets used in this study come from Guangdong Industrial Wisdom Big Data Innovation Competition - Intelligent Algorithm Competition Rematch. and the dataset download link is https://tianchi.aliyun.com/competition/entrance/231682/information?lang=en-us. Officially, there are 4,356 images, including single blemish images, multiple blemish images and no blemish images. The official website provides 4,356 images, including single defect images, multiple defect images and no defect images. We have selected only single defect images and multiple defect images, which are 3,233 images in total. The ten defects are non-conductive, effacement, miss bottom corner, orange, peel, varicolored, jet, lacquer bubble, jump into a pit, divulge the bottom and blotch. Each image contains one or more defects, and the resolution of the defect images are all 2560×1920.By investigating the literature, we found that most of the experiments were done with 10 types of defects, so we chose three more types of defects that are more different from these ten types and more in number, which are suitable for the experiments. The three newly added datasets come from the preliminary dataset of Guangdong Industrial Wisdom Big Data Intelligent Algorithm Competition. The dataset can be downloaded from https://tianchi.aliyun.com/dataset/140666. There are 3,000 images in total, among which 109, 73 and 43 images are for the defects of bruise, camouflage and coating cracking respectively. Finally, the 10 types of defects in the rematch and the 3 types of defects selected in the preliminary round are fused into a new dataset, which is examined in this dataset.In the processing of the dataset, we tried different division ratios, such as 8:2, 7:3, 7:2:1, etc. After testing, we found that the experimental results did not differ much for different division ratios. Therefore, we divide the dataset according to the ratio of 7:2:1, the training set accounts for 70%, the validation set accounts for 20%, and the testing set accounts for 10%. At the same time, the random number seed is set to 0 to ensure that the results obtained are consistent every time the model is trained.Finally, the mean Average Precision (mAP) metric obtained from the experiment was tested on the dataset a total of three times. Each time the results differed very little, but for the accuracy of the experimental results, we took the average value derived from the highest and lowest results. The highest was 71.5% and the lowest was 71.1%, resulting in an average detection accuracy of 71.3% for the final experiment.All data and images utilized in this research are from publicly available sources, and the original creators have given their consent for these materials to be published in open-access formats.The settings for other parameters are as follows. epochs: 200,patience: 50,batch: 16,imgsz: 640,pretrained: true,optimizer: SGD,close_mosaic: 10,iou: 0.7,momentum: 0.937,weight_decay: 0.0005,box: 7.5,cls: 0.5,dfl: 1.5,pose: 12.0,kobj: 1.0,save_dir: runs/trainThe defeat_dataset.(ZIP)is mentioned in the Supporting information section of our manuscript. The underlying data are held at Figshare. DOI: 10.6084/m9.figshare.27922929.The results_images.zipin the system contains the experimental results graphs.The images_1.zipand images_2.zipin the system contain all the images needed to generate the manuscript.tex manuscript.

  9. t

    Credit Card Fraud Detection

    • test.researchdata.tuwien.ac.at
    • zenodo.org
    • +1more
    csv, json, pdf +2
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja (2025). Credit Card Fraud Detection [Dataset]. http://doi.org/10.82556/yvxj-9t22
    Explore at:
    text/markdown, csv, pdf, txt, jsonAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja; Ajdina Grizhja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:

    1. Dataset Description

    Research Domain
    This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.

    Purpose
    The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.

    Data Sources
    We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.

    Method of Dataset Preparation

    1. Schema validation: Renamed columns to snake_case (e.g. transaction_amount, is_declined) so they conform to DBRepo’s requirements.

    2. Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).

    3. Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr. Each subset was materialized in DBRepo and assigned its own PID for precise citation.

    4. Cleaning: Converted the categorical flags (is_declined, isforeigntransaction, ishighriskcountry, isfradulent) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr, merchant_id).

    5. Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.

    2. Technical Details

    Dataset Structure

    • The raw data is a single CSV with columns:

      • actionnr (integer transaction ID)

      • merchant_id (string)

      • average_amount_transaction_day (float)

      • transaction_amount (float)

      • is_declined, isforeigntransaction, ishighriskcountry, isfradulent (binary flags)

      • total_number_of_declines_day, daily_chargeback_avg_amt, sixmonth_avg_chbk_amt, sixmonth_chbk_freq (numeric features)

    Naming Conventions

    • All columns use lowercase snake_case.

    • Subsets are named creditcard_training, creditcard_validation, creditcard_test in DBRepo.

    • Files in the code repo follow a clear structure:

      ├── data/         # local copies only; raw data lives in DBRepo 
      ├── notebooks/Task.ipynb 
      ├── models/rf_model_v1.joblib 
      ├── outputs/        # confusion_matrix.png, roc_curve.png, predictions.csv 
      ├── README.md 
      ├── requirements.txt 
      └── codemeta.json 
      

    Required Software

    • Python 3.9+

    • pandas, numpy (data handling)

    • scikit-learn (modeling, metrics)

    • matplotlib (visualizations)

    • dbrepo‐client.py (DBRepo API)

    • requests (TU WRD API)

    Additional Resources

    3. Further Details

    Data Limitations

    • Highly imbalanced: only ~0.17% of transactions are fraudulent.

    • Anonymized PCA features (V1V28) hidden; we extended with domain features but cannot reverse engineer raw variables.

    • Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.

    Licensing and Attribution

    • Raw data: CC-0 (per Kaggle terms)

    • Code & notebooks: MIT License

    • Model artifacts & outputs: CC-BY 4.0

    • DUWRD records include ORCID identifiers for the author.

    Recommended Uses

    • Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.

    • Educational purposes: demonstrating model‐training pipelines, FAIR data practices.

    • Extension: adding time‐series or deep‐learning models.

    Known Issues

    • Possible temporal leakage if date/time features not handled correctly.

    • Model performance may degrade on live data due to concept drift.

    • Binary flags may oversimplify nuanced transaction outcomes.

  10. n

    Vocalizations in the plains zebra (Equus quagga)

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bing Xie; Virgile Daunay; Troels Petersen; Elodie Briefer (2024). Vocalizations in the plains zebra (Equus quagga) [Dataset]. http://doi.org/10.5061/dryad.v9s4mw73w
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 21, 2024
    Dataset provided by
    Université Lumière Lyon 2
    University of Copenhagen
    Authors
    Bing Xie; Virgile Daunay; Troels Petersen; Elodie Briefer
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Acoustic signals are vital in animal communication, and quantifying these signals them is fundamental for understanding animal behaviour and ecology. Vocaliszations can be classified into acoustically and functionally or contextually distinct categories, but establishing these categories can be challenging. Newly developed methods, such as machine learning, can provide solutions for classification tasks. The plains zebra is known for its loud and specific vocaliszations, yet limited knowledge exists on the structure and information content of its vocaliszations. In this study, we employed both feature-based and spectrogram-based algorithms, incorporating supervised and unsupervised machine learning methods to enhance robustness in categoriszing zebra vocaliszation types. Additionally, we implemented a permuted discriminant function analysis (pDFA) to examine the individual identity information contained in the identified vocaliszation types. The findings revealed at least four distinct vocaliszation types he ‘“snort’,” the ‘“soft snort’,” the ‘“squeal’,” and the ‘“quagga quagga’” with individual differences observed mostly in snorts, and to a lesser extent in squeals. Analyses based on acoustic features outperformed those based on spectrograms, but each excelled in characteriszing different vocaliszation types. We thus recommend the combined use of these two approaches. OuThisr study offers valuable insights into plains zebra vocaliszation, with implications for future comprehensive explorations in animal communication. Methods Data collection and sampling We collected data in three locations, in Denmark and South Africa: 1) 10 months between December 2020 and July 2021 and between September and December 2021, at Pilanesberg National Park (hereafter “PNP”), South Africa, covering both dry season (i.e. from May to September) and wet season (i.e. from October to April) (1); 2) 16 days between May and June 2019, and 33 days between February and May 2022, at Knuthenborg Safari Park (hereafter “KSP”), Denmark, covering both periods before the park’s opening for tourists (i.e. from November to March) and after (i.e. from April to October); 3) 4 days in August 2019 at Givskud Zoo (hereafter “GKZ”), Denmark. For all places and periods, three types of data were collected as follows: 1) Pictures were taken for each individual from both sides using a camera (Nikon COOLPIX P950); 2) Contexts of vocal production were recorded either through notes (in the first period of KSP and in GKZ) or videos (in the second period of KSP and in PNP) filmed by a video camera recorder (Sony HDRPJ410 HD); 3) Audio recordings were collected using a directional microphone (Sennheiser MKH-70 P48, with a frequency response of 50 - 20000 Hz (+/- 2.5 dB)) linked to an audio recorder (Marantz PMD661 MKIII). Six zebras housed in GKZ were recorded while being separated from one another into three enclosures (the stable, the small enclosure and the savannah) manually by the zookeeper for management purpose, which triggered vocalisations. These vocalisations, along with other types of data, were recorded at distances of 5 - 30 m. In KSP, 15 - 18 zebras (population changed due to newborns, deaths, or removal of adult males) were living with other herbivores in a 0.14 km2 savannah. There, we approached the zebras by driving down the road until approximately 7 - 40 m, at which point spontaneous vocalisations and other information were collected. This distance allowed us to collect good quality recordings without eliciting any obvious reactions from the zebras to our presence. Finally, PNP is a 580 km2 national park, with approximately 800 - 2000 zebras (2). In this park, we drove on the road and parked at distances of 10 - 80 m when encountering zebras, where all data, including spontaneous vocalisations, were recorded. Data processing Individual zebras were manually identified based on the pictures collected from KSP and GKZ (15-18 and 6 zebras, respectively). In PNP, the animals present in the pictures were individually identified using WildMe (https://zebra.wildme.org/), a web-based machine learning platform facilitating individual recognition. All zebra pictures were uploaded to the platform for a full comparison through the algorithm. The resulting matching candidates were then determined by manually reviewing the output. Audio files (sampling rate: 44100 Hz) were saved at 16-bit amplitude resolution in WAV format. We annotated zebra vocalisations, along with context and individuals emitting the vocalisations, using Audacity software (version 3.3.3) (3). Vocalisations were first subjectively labelled as five vocalisation types based on both audio and spectrogram examinations (i.e. visually inspection) (Table 1 and Figure 1). Among these types, the “squeal-snort” was excluded from further analysis, as the focus of this study was on individual vocalisation types instead of combinations. Acoustic analysis We extracted vocalisations of good quality, defined as vocalisations with clear spectrograms, low background noise, and no overlap with other sounds, and saved them as distinct audio files. For the individual distinctiveness analysis, we excluded individuals with fewer than 5 vocalisations of each type, to avoid strong imbalance, resulting in 359 snorts from 28 individuals and 138 squeals from 14 individuals (Table S3 and S4) (4, 5). The individuality content of quagga quagga and soft snorts could not be explored, due to insufficient individual data. For vocal repertoire analysis, we excluded vocalisations longer than 1.25 s to improve spectrogram-based analysis, following Thomas et al (6). In total, we gathered 678 vocalisations for the spectrogram-based vocal repertoire analysis, including 117 quagga quagga, 204 snorts, 161 squeals and 196 soft snorts (Table S2). Among these vocalisations, six squeals were excluded in the acoustic feature-based vocal repertoire analysis, due to missing data for one of the features (amplitude modulation extent). All calls were first high-passed filtered above 30 Hz for snorts and soft snorts, above 500 Hz for squeals and above 600 Hz for quagga quagga (i.e. above the average minimum fundamental frequency of these vocalisations; Table S1). We then extracted 12 acoustic features from vocalisations for the individual distinctiveness analysis (Table 2), using a custom script (7-10) in Praat software (11). Eight of these features were also extracted for the vocal repertoire analysis (i.e. all features except those related to the fundamental frequency, which were not available for soft snorts that are not tonal). Additionally, to explore the vocal repertoire, mel-spectrograms were generated from audio files using STFT, following Thomas et al. (6). Spectrograms were padded with zeros according to the length of the longest audio file to ensure uniform length for all audio files, and time-shift adjustments were implemented to align the starting points of vocalisations (6). Statistical analyses a. Vocal repertoire We applied both supervised and unsupervised machine learning to both acoustic features and spectrogram using Python (version 3.9.7) (12). Supervised method. To define the vocal repertoire via an acoustic feature-based approach, we deployed feature importance analysis by SHapley Additive exPlanation (SHAP) (13), using the shap library (version 0.40.0) (14). Six features with SHAP value > 1 were selected (Figure S1). We split the selected features with vocalisation type labels into a training dataset (70%) and a testing dataset (30%) using the Scikit-learn library (function: train_test_split, version 0.24.2) (15). Subsequently, we employed a supervised approach, the eXtreme Gradient Boosting (XGBoost) classifier in xgboost library (version 1.6.0) (16) to train the model. Three hyperparameters were tuned on the training dataset to reach maximum accuracy using optuna library (direction = minimize, n_trials = 200, version 2.10.0) (17), incorporating cross validation (five folds), which resulted in the best model (Table S5). To define the vocal repertoire via a spectrogram-based approach, we split the dataset into a training set (49%), a validation set (21%), and a test dataset (30%), using the Scikit-learn library (function: train_test_split, version 0.24.2) (15). We implemented a Convolutional Neural Network (CNN) architecture using the tensorflow library (version 2.8.0) (18). The architecture was constructed (Table S6) and seven hyperparameters were tuned to reach maximum accuracy on the training and validation dataset using the optuna library (direction = minimize, n_trials = 50, version 2.10.0) (17), which resulted in the best model (Table S6). We evaluated model performance for both feature-based and spectrogram-based classification models through predictions on each test dataset, including the test accuracy across all call types (number of correct predictions / total number of predictions), and three metrics for each call type; precision (true positives / (true positive + false positives)), recall (true positives / (true positives + false negatives) and the harmonic mean of precision and recall — f1-score (2 × (precision × recall) / (precision + recall) (19). We also plotted the confusion matrix between true classes and predicted classes. Unsupervised method. For both acoustic feature-based and spectrogram-based analyses, we applied Uniform Manifold Approximation and Projection (UMAP) in the umap library (function: umap.UMAP, n_neighbors=200 and local_connectivity= 150 for acoustic feature-based analysis, and metric = calc_timeshift_pad and min_dist = 0 for spectrogram-based analysis, version 0.1.1) (20), to reduce variables into a 2-dimensional latent space. We also implemented k-means clustering algorithm for both analyses from the Scikit-learn library (function: kmeans.fit, version 0.24.2) (15), to identify distinct clusters using the elbow method (21). The

  11. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.ac.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    csv, text/markdown, json, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  12. p

    Data from: ZooScanNet: plankton images captured with the ZooScan

    • pigma.org
    rel-canonical +2
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sorbonne Université, CNRS, Laboratoire d’Océanographie de Villefanche, LOV, 06230 Villefranche-sur-mer, France (2025). Sorbonne Université, CNRS, Laboratoire d’Océanographie de Villefanche, LOV, 06230 Villefranche-sur-mer, France [Dataset]. https://www.pigma.org/geonetwork/srv/api/records/seanoe:55741
    Explore at:
    www:link-1.0-http--metadata-url, www:download-1.0-link--download, rel-canonicalAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    Sorbonne Université, CNRS, Laboratoire d’Océanographie de Villefanche, LOV, 06230 Villefranche-sur-mer, France
    Time period covered
    Jan 4, 1995 - Oct 23, 2019
    Area covered
    Description

    Plankton was sampled with various nets, from bottom or 500m depth to the surface, in many oceans of the world. Samples were imaged with a ZooScan. The full images were processed with ZooProcess which generated regions of interest (ROIs) around each individual object and a set of associated features measured on the object (see Gorsky et al 2010 for more information). The same objects were re-processed to compute features with the scikit-image toolbox http://scikit-image.org. The 1,451,745 resulting objects were sorted by a limited number of operators, following a common taxonomic guide, into 98 taxa, using the web application EcoTaxa http://ecotaxa.obs-vlfr.fr. For the purpose of training machine learning classifiers, the images in each class were split into training, validation, and test sets, with proportions 70%, 15% and 15%. The folder ZooScanNet_data.tar contains : taxa.csv.gz Table of the classification of each object in the dataset, with columns : - objid: unique object identifier in EcoTaxa (integer number) - taxon_level1: taxonomic name corresponding to the level 1 classification - lineage_level1: taxonomic lineage corresponding to the level 1 classification - taxon_level2: name of the taxon corresponding to the level 2 classification - plankton: if the object is a plankton or not (boolean) - set: class of the image corresponding to the taxon (train : training, val : validation, or test) - img_path: local path of the image corresponding to the taxon (of level 1), named according to the object id features_native.csv.gz Table of metadata of each object including the different features processed by ZooProcess. All features are computed on the object only, not the background. All area/length measures are in pixels. All grey levels are in encoded in 8 bits (0=black, 255=white). With columns: - objid: unique object identifier in EcoTaxa (integer number) And 48 features: - area - mean - stddev - mode - min/max - perim. - width,height - major,minor - circ. - feret - intden - median - skew,kurt - %area - area_exc - fractal - skelarea - slope - histcum1,2,3 - nb1,2,3 - symetrieh,symetriev - symetriehc,symetrievc - convperim,convarea - fcons - thickr: - esd - elongation - range - centroids - sr - perimareaexc - feretareaexc - perimferet/perimmajor - circex - cdexc See the “ZooScan” sheet - OBJECT metadata, annotation and measurements - , at https://doi.org/10.5281/zenodo.14704250 for definitions. features_skimage.csv.gz Table of morphological features recomputed with skimage.measure.regionprops on the ROIs produced by ZooProcess. See http://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.regionprops for documentation. inventory.tsv Tree view of the taxonomy and number of images in each taxon, displayed as text. With columns : - lineage_level1: taxonomic lineage corresponding to the level 1 classification - taxon_level1: name of the taxon corresponding to the level 1 classification - n: number of objects in each taxon class

     2. Second folder ZooScanNet_imgs.tar contains :
    

    imgs Directory containing images of each object, named according to the object id objid and sorted in subdirectories according to their taxon.

     3. And :
    

    map.png Map of the sampling locations, to give an idea of the diversity sampled in this dataset.

  13. ZooScanNet: plankton images captured with the ZooScan

    • seanoe.org
    • pigma.org
    • +2more
    image/*
    Updated Sep 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amanda Elineau; Corinne Desnos; Laetitia Jalabert; Marion Olivier; Jean-Baptiste Romagnan; Manoela Costa Brandao; Fabien Lombard; Natalia Llopis; Justine Courboulès; Louis Caray-Counil; Bruno Serranito; Jean-Olivier Irisson; Marc Picheral; Gaby Gorsky; Lars Stemmann (2024). ZooScanNet: plankton images captured with the ZooScan [Dataset]. http://doi.org/10.17882/55741
    Explore at:
    image/*Available download formats
    Dataset updated
    Sep 2024
    Dataset provided by
    SEANOE
    Authors
    Amanda Elineau; Corinne Desnos; Laetitia Jalabert; Marion Olivier; Jean-Baptiste Romagnan; Manoela Costa Brandao; Fabien Lombard; Natalia Llopis; Justine Courboulès; Louis Caray-Counil; Bruno Serranito; Jean-Olivier Irisson; Marc Picheral; Gaby Gorsky; Lars Stemmann
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Jan 3, 1995 - Oct 22, 2019
    Area covered
    Description

    plankton was sampled with various nets, from bottom or 500m depth to the surface, in many oceans of the world. samples were imaged with a zooscan. the full images were processed with zooprocess which generated regions of interest (rois) around each individual object and a set of associated features measured on the object (see gorsky et al 2010 for more information). the same objects were re-processed to compute features with the scikit-image toolbox http://scikit-image.org. the 1,451,745 resulting objects were sorted by a limited number of operators, following a common taxonomic guide, into 98 taxa, using the web application ecotaxa http://ecotaxa.obs-vlfr.fr. for the purpose of training machine learning classifiers, the images in each class were split into training, validation, and test sets, with proportions 70%, 15% and 15%.the folder zooscannet_data.tar contains :taxa.csv.gztable of the classification of each object in the dataset, with columns :objid: unique object identifier in ecotaxa (integer number)taxon_level1: taxonomic name corresponding to the level 1 classificationlineage_level1: taxonomic lineage corresponding to the level 1 classificationtaxon_level2: name of the taxon corresponding to the level 2 classification plankton: if the object is a plankton or not (boolean)set: class of the image corresponding to the taxon (train : training, val : validation, or test)img_path: local path of the image corresponding to the taxon (of level 1), named according to the object idfeatures_native.csv.gztable of metadata of each object including the different features processed by zooprocess. all features are computed on the object only, not the background. all area/length measures are in pixels. all grey levels are in encoded in 8 bits (0=black, 255=white). with columns:objid: unique object identifier in ecotaxa (integer number)and 48 features:areameanstddevmodemin/maxperim.width,height major,minorcirc.feretintdenmedianskew,kurt%areaarea_excfractalskelareaslopehistcum1,2,3nb1,2,3symetrieh,symetrievsymetriehc,symetrievcconvperim,convareafconsthickr: esdelongationrangecentroidssrperimareaexcferetareaexcperimferet/perimmajorcircexcdexcsee the “zooscan” sheet - object metadata, annotation and measurements - , at https://doi.org/10.5281/zenodo.14704250 for definitions.features_skimage.csv.gztable of morphological features recomputed with skimage.measure.regionprops on the rois produced by zooprocess. see http://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.regionprops for documentation.inventory.tsvtree view of the taxonomy and number of images in each taxon, displayed as text. with columns :lineage_level1: taxonomic lineage corresponding to the level 1 classificationtaxon_level1: name of the taxon corresponding to the level 1 classificationn: number of objects in each taxon class 2. second folder zooscannet_imgs.tar contains :imgsdirectory containing images of each object, named according to the object id objid and sorted in subdirectories according to their taxon. 3. and :map.pngmap of the sampling locations, to give an idea of the diversity sampled in this dataset.

  14. e

    ZooCAMNet : plankton images captured with the ZooCAM - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Oct 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Oct 24, 2024
    Description

    Plankton was sampled with a Continuous Underway Fish Egg Sampler (CUFES, 315µm mesh size) at 4 m below the surface, and a WP2 net (200µm mesh size) from 100m to the surface, or 5 m above the sea floor to the surface when the depth was < 100 m, in the Bay of Biscay. The full images were processed with the ZooCAM software and the embedded Matrox Imaging Library (Colas et a., 2018) which generated regions of interest (ROIs) around each individual object and a set of features measured on the object. The same objects were re-processed to compute features with the scikit-image library http://scikit-image.org. The 1, 286, 590 resulting objects were sorted by a limited number of operators, following a common taxonomic guide, into 93 taxa, using the web application EcoTaxa http://ecotaxa.obs-vlfr.fr. For the purpose of training machine learning classifiers, the images in each class were split into training, validation, and test sets, with proportions 70%, 15% and 15%. The archive contains : taxa.csv.gz Table of the classification of each object in the dataset, with columns : - objid : unique object identifier in EcoTaxa (integer number). - taxon_level1 : taxonomic name corresponding to the level 1 classification - lineage_level1 : taxonomic lineage corresponding to the level 1 classification - taxon_level2 : name of the taxon corresponding to the level 2 classification - plankton : if the object is a plankton or not (boolean) - set : class of the image corresponding to the taxon (train : training, val : validation, or test) - img_path : local path of the image corresponding to the taxon (of level 1), named according to the object id features_native.csv.gz Table of morphological features computed by ZooCAM. All features are computed on the object only, not the background. All area/length measures are in pixels. All grey levels are in encoded in 8 bits (0=black, 255=white). With columns : - area : object's surface - area_exc : object surface excluding white pixels - area_based_diameter : object's Area Based Diameter: 2 * (object_area/pi)^(1/2) - meangreyobjet : mean image grey level - modegreyobjet : modal object grey level - sigmagrey : object grey level standard deviation - mingrey : minimum object grey level - maxgrey : maximum object grey level - sumgrey : object grey level integrated density: object_mean*object_area - breadth : breadth of the object along the best fitting ellipsoid minor axis - length : breadth of the object along the best fitting ellipsoid majorr axis - elongation : elongation index: object_length/object_breadth - perim : object's perimeter - minferetdiam : minimum object's feret diameter - maxferetdiam : maximum object's feret diameter - meanferetdiam : average object's feret diameter - feretelongation : elongation index: object_maxferetdiam/object_minferetdiam - compactness : Isoperimetric quotient: the ration of the object's area to the area of a circle having the same perimeter - intercept0, intercept45 , intercept90, intercept135 : the number of times that a transition from background to foreground occurs a the angle 0ø, 45ø, 90ø and 135ø for the entire object - convexhullarea : area of the convex hull of the object - convexhullfillratio : ratio object_area/convexhullarea - convexperimeter : perimeter of the convex hull of the object - n_number_of_runs : number of horizontal strings of consecutive foreground pixels in the object - n_chained_pixels : number of chained pixels in the object - n_convex_hull_points : number of summits of the object's convex hull polygon - n_number_of_holes : number of holes (as closed white pixel area) in the object - roughness : measure of small scale variations of amplitude in the object's grey levels - rectangularity : ratio of the object's area over its best bounding rectangle's area - skewness : skewness of the object's grey level distribution - kurtosis : kurtosis of the object's grey level distribution - fractal_box : fractal dimension of the object's perimeter - hist25, hist50, hist75 : grey level value at quantile 0.25, 0.5 and 0.75 of the object's grey levels normalized cumulative histogram - valhist25, valhist50, valhist75 : sum of grey levels at quantile 0.25, 0.5 and 0.75 of the object's grey levels normalized cumulative histogram - nobj25, nobj50, nobj75 : number of objects after thresholding at the object_valhist25, object_valhist50 and object_valhist75 grey level - symetrieh :index of horizontal symmetry - symetriev : index of vertical symmetry - skelarea : area of the object skeleton - thick_r : maximum object's thickness/mean object's thickness - cdist : distance between the mass and the grey level object's centroids features_skimage.csv.gz Table of morphological features recomputed with skimage.measure.regionprops on the ROIs produced by ZooCAM. See http://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.regionprops for documentation. inventory.tsv Tree view of the taxonomy and number of images in each taxon, displayed as text. With columns : - lineage_level1 : taxonomic lineage corresponding to the level 1 classification - taxon_level1 : name of the taxon corresponding to the level 1 classification - n : number of objects in each taxon group map.png Map of the sampling locations, to give an idea of the diversity sampled in this dataset. imgs Directory containing images of each object, named according to the object id objid and sorted in subdirectories according to their taxon.

  15. UVP6Net : plankton images captured with the UVP6

    • seanoe.org
    bin, image/*
    Updated Sep 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Picheral; Laetitia Jalabert; Solène Motreuil; Lucas Courchet; Louis Carray-Counil; Florian Ricour; Thelma Panaiotis; Flavien Petit; Amanda Elineau (2024). UVP6Net : plankton images captured with the UVP6 [Dataset]. http://doi.org/10.17882/101948
    Explore at:
    image/*, binAvailable download formats
    Dataset updated
    Sep 2024
    Dataset provided by
    SEANOE
    Authors
    Marc Picheral; Laetitia Jalabert; Solène Motreuil; Lucas Courchet; Louis Carray-Counil; Florian Ricour; Thelma Panaiotis; Flavien Petit; Amanda Elineau
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Oct 11, 2019 - Jan 19, 2022
    Area covered
    Description

    plankton was imaged with uvp6 in contrasted oceanic regions. the full images were processed by the uvp6 firmware and the regions of interest (rois) around each individual object were recorded. a set of associated features were measured on the objects (see picheral et al. 2021, doi:10.1002/lom3.10475, for more information). all objects were classified by a limited number of operators into 110 different classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). the following dataset corresponds to the 634 459 objects that have an area superior to 73 pixels (equivalent spherical diameter of 9.8 pixels, corresponding to the default size limit of 620µm in the uvp6 configuration). the different files provide information about the features of the objects, their taxonomic identification as well as the raw images. for the purpose of training machine learning classifiers, the images in each class were split into training, validation, and test sets, with proportions 70%, 15% and 15%.an additional folder is provided, which includes the subset of images used to train the unique embedded classification model of the uvp6 actually deployed on the nke cts5 floats (10.5281/zenodo.10694203). these images correspond to uvp6net objects filtered to retain only those with a size of 79 pixels to fit with the 645µm class from ecopart, resulting in a total of 595,595 objects. the taxonomic identification was also made coarser (from 110 classes to 20) to ensure adequate performance of the classification model on power-constrained hardware. images in this subset display objects as shades of grey/white on a black background.the folder uvp6net_data.tar contains :taxa.csv.gztable of the classification of each object in the dataset, with columns :objid: unique object identifier in ecotaxa (integer number).taxon_level1: taxonomic name corresponding to the level 1 classificationlineage_level1: taxonomic lineage corresponding to the level 1 classification taxon_level2: name of the taxon corresponding to the level 2 classificationplankton: if the object is a plankton or not (boolean)set: class of the image corresponding to the taxon (train: training, val: validation, or test)img_path: local path of the image corresponding to the taxon (of level 1), named according to the object idfeatures_native.csv.gztable of metadata of each object including the different features processed by the uvpapp application. all features are computed on the object only, excluding the background. all area/length measures are in pixels. all grey levels are encoded in 8 bits (0=black, 255=white). with columns :objid: unique object identifier in ecotaxa (integer number).and 62 features: areameanstddevmodeminmaxperimwidthheightmajorminoranglecircferetintdenmedianskewkurt%areaarea_excfractalskelareaslopehistcum1, 2, 3nb1 nb2 nb3symetriehsymetrievsymetriehcsymetrievcconvperimconvareafconsthickrelongationrangemeanposcvsrperimareaexcferetareaexcperimferetperimmajorcircexkurt_meanskew_meanconvperim_perimconvarea_areasymetrieh_areasymetriev_areanb1, nb2, nb3_areanb1, nb2, nb3_rangemedian_mean/median_mean_rangeskeleton_areasee object measurements at https://doi.org/10.5281/zenodo.14704250 for definitions.features_skimage.csv.gztable of morphological features recomputed with skimage.measure.regionprops on the rois produced by uvp6 firmware. see http://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.regionprops for documentation.inventory.tsvtree view of the taxonomy and number of images in each taxon, displayed as text. with columns :lineage_level1: taxonomic lineage corresponding to the level 1 classificationtaxon_level1: name of the taxon corresponding to the level 1 classification n: number of objects in each class 2. second folder uvp6net_imgs.tar contains :imgsimages of each object, named according to the object id objid and sorted in subdirectories according to their taxon. 3. the last folder uvpec_imgs.tar contains :imgsimages of each object on a black background, stored in the format required to train and embedded classifier with the uvpec package (https://github.com/ecotaxa/uvpec); i.e. each image is stored as “objid.jpg” in folders corresponding to their taxon (20 different classes), named “taxon_name_taxon_id”. 4. and :map.pngmap of the sampling locations, to give an idea of the diversity sampled in this dataset.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02

FAIR Dataset for Disease Prediction in Healthcare Applications

Explore at:
csv, json, bin, pngAvailable download formats
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Description

Context and Methodology

  • Research Domain/Project:
    This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

  • Purpose of the Dataset:
    The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

  • Dataset Creation:
    Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

  • Structure of the Dataset:
    The dataset consists of several files organized into folders by data type:

    • Training Data: Contains the training dataset used to train the machine learning model.

    • Validation Data: Used for hyperparameter tuning and model selection.

    • Test Data: Reserved for final model evaluation.

    Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

  • Software Requirements:
    To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

    • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

  • Reusability:
    Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

  • Limitations:
    The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

Search
Clear search
Close search
Google apps
Main menu