25 datasets found
  1. h

    cdg-AICourse-Level2-Sklearn

    • huggingface.co
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cahlen Humphreys (2025). cdg-AICourse-Level2-Sklearn [Dataset]. https://huggingface.co/datasets/cahlen/cdg-AICourse-Level2-Sklearn
    Explore at:
    Dataset updated
    Apr 16, 2025
    Authors
    Cahlen Humphreys
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Professor Enfuse & Learner: Comparing supervised and unsupervised learning strategies for real-world applications - Generated by Conversation Dataset Generator

    This dataset was generated using the Conversation Dataset Generator script available at https://cahlen.github.io/conversation-dataset-generator/.

      Generation Parameters
    

    Number of Conversations Requested: 500 Number of Conversations Successfully Generated: 500 Total Turns: 6659 Model ID:… See the full description on the dataset page: https://huggingface.co/datasets/cahlen/cdg-AICourse-Level2-Sklearn.

  2. f

    CK4Gen, High Utility Synthetic Survival Datasets

    • figshare.com
    zip
    Updated Nov 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Kuo (2024). CK4Gen, High Utility Synthetic Survival Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.27611388.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 5, 2024
    Dataset provided by
    figshare
    Authors
    Nicholas Kuo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ===###Overview:This repository provides high-utility synthetic survival datasets generated using the CK4Gen framework, optimised to retain critical clinical characteristics for use in research and educational settings. Each dataset is based on a carefully curated ground truth dataset, processed with standardised variable definitions and analytical approaches, ensuring a consistent baseline for survival analysis.###===###Description:The repository includes synthetic versions of four widely utilised and publicly accessible survival analysis datasets, each anchored in foundational studies and aligned with established ground truth variations to support robust clinical research and training.#---GBSG2: Based on Schumacher et al. [1]. The study evaluated the effects of hormonal treatment and chemotherapy duration in node-positive breast cancer patients, tracking recurrence-free and overall survival among 686 women over a median of 5 years. Our synthetic version is derived from a variation of the GBSG2 dataset available in the lifelines package [2], formatted to match the descriptions in Sauerbrei et al. [3], which we treat as the ground truth.ACTG320: Based on Hammer et al. [4]. The study investigates the impact of adding the protease inhibitor indinavir to a standard two-drug regimen for HIV-1 treatment. The original clinical trial involved 1,151 patients with prior zidovudine exposure and low CD4 cell counts, tracking outcomes over a median follow-up of 38 weeks. Our synthetic dataset is derived from a variation of the ACTG320 dataset available in the sksurv package [5], which we treat as the ground truth dataset.WHAS500: Based on Goldberg et al. [6]. The study follows 500 patients to investigate survival rates following acute myocardial infarction (MI), capturing a range of factors influencing MI incidence and outcomes. Our synthetic data replicates a ground truth variation from the sksurv package, which we treat as the ground truth dataset.FLChain: Based on Dispenzieri et al. [7]. The study assesses the prognostic relevance of serum immunoglobulin free light chains (FLCs) for overall survival in a large cohort of 15,859 participants. Our synthetic version is based on a variation available in the sksurv package, which we treat as the ground truth dataset.###===###Notes:Please find an in-depth discussion on these datasets, as well as their generation process, in the link below, to our paper:https://arxiv.org/abs/2410.16872Kuo, et al. "CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare." arXiv preprint arXiv:2410.16872 (2024).###===###References:[1]: Schumacher, et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German breast cancer study group.”, Journal of Clinical Oncology, 1994.[2]: Davidson-Pilon “lifelines: Survival Analysis in Python”, Journal of Open Source Software, 2019.[3]: Sauerbrei, et al. “Modelling the effects of standard prognostic factors in node-positive breast cancer”, British Journal of Cancer, 1999.[4]: Hammer, et al. “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less”, New England Journal of Medicine, 1997.[5]: Pölsterl “scikit-survival: A library for time-to-event analysis built on top of scikit-learn”, Journal of Machine Learning Research, 2020.[6]: Goldberg, et al. “Incidence and case fatality rates of acute myocardial infarction (1975–1984): the Worcester heart attack study”, American Heart Journal, 1988.[7]: Dispenzieri, et al. “Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population”, in Mayo Clinic Proceedings, 2012.

  3. h

    iris

    • huggingface.co
    Updated Sep 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    scikit-learn (2022). iris [Dataset]. https://huggingface.co/datasets/scikit-learn/iris
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 23, 2022
    Dataset authored and provided by
    scikit-learn
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Iris Species Dataset

    The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository. It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. The dataset is taken from UCI Machine Learning Repository's… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/iris.

  4. Additional file 2 of seqQscorer: automated quality control of...

    • springernature.figshare.com
    xlsx
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steffen Albrecht; Maximilian Sprang; Miguel A. Andrade-Navarro; Jean-Fred Fontaine (2023). Additional file 2 of seqQscorer: automated quality control of next-generation sequencing data using machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.14173803.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Steffen Albrecht; Maximilian Sprang; Miguel A. Andrade-Navarro; Jean-Fred Fontaine
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2. Table of grid search parameters. Parameters are relevant to a scikit-learn implementation.

  5. XIMAGENET-12: An Explainable AI Benchmark CVPR2024

    • kaggle.com
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anomly (2023). XIMAGENET-12: An Explainable AI Benchmark CVPR2024 [Dataset]. http://doi.org/10.34740/kaggle/ds/3123294
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anomly
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Introduction:

    XimageNet-12https://qiangli.de/imgs/flowchart2%20(1).png">

    🌟 XimageNet-12 🌟

    An Explainable Visual Benchmark Dataset for Robustness Evaluation. A Dataset for Image Background Exploration!

    Blur Background, Segmented Background, AI-generated Background, Bias of Tools During Annotation, Color in Background, Random Background with Real Environment

    +⭐ Follow Authors for project updates.

    Website: XimageNet-12

    Here, we trying to understand how image background effect the Computer Vision ML model, on topics such as Detection and Classification, based on baseline Li et.al work on ICLR 2022: Explainable AI: Object Recognition With Help From Background, we are now trying to enlarge the dataset, and analysis the following topics: Blur Background / Segmented Background / AI generated Background/ Bias of tools during annotation/ Color in Background / Dependent Factor in Background/ LatenSpace Distance of Foreground/ Random Background with Real Environment! Ultimately, we also define the math equation of Robustness Scores! So if you feel interested How would we make it or join this research project? please feel free to collaborate with us!

    In this paper, we propose an explainable visual dataset, XIMAGENET-12, to evaluate the robustness of visual models. XIMAGENET-12 consists of over 200K images with 15,410 manual semantic annotations. Specifically, we deliberately selected 12 categories from ImageNet, representing objects commonly encountered in practical life. To simulate real-world situations, we incorporated six diverse scenarios, such as overexposure, blurring, and color changes, etc. We further develop a quantitative criterion for robustness assessment, allowing for a nuanced understanding of how visual models perform under varying conditions, notably in relation to the background.

    Progress:

    • Blur Background-> Done! You can find the image Generated in the corresponding folder!
    • Segmented Background -> Done! you can download the image and its corresponding transparent mask image!
    • Color in Background->Done!~~ you can now download the image with different background color modified, and play with different color-ed images!
    • Random Background with Real Environment -> Done! you can also find we generated the image with the photographer's real image as a background and removed the original background of the target object, but similar to the style!
    • Bias of tools during annotation->Done! for this one, you won't get a new image, because this is about math and statistics data analysis when different tools and annotators are applied!
    • AI generated Background-> current on progress ( 12 /12) Done!, So basically you can find one sample folder image we uploaded, please take a look at how real it is, and guess what LLM model we are using to generate the high-resolution background to make it so real :)

    What tool we used to generate those images?

    We employed a combination of tools and methodologies to generate the images in this dataset, ensuring both efficiency and quality in the annotation and synthesis processes.

    • IoG Net: Initially, we utilized the IoG Net, which played a foundational role in our image generation pipeline.
    • Polygon Faster Labeling Tool: To facilitate the annotation process, we developed a custom Polygon Faster Labeling Tool, streamlining the labeling of objects within the images.AnyLabeling Open-source Project: We also experimented with the AnyLabeling open-source project, exploring its potential for our annotation needs.
    • V7 Lab Tool: Eventually, we found that the V7 Lab Tool provided the most efficient labeling speed and delivered high-quality annotations. As a result, we standardized the annotation process using this tool.
    • Data Augmentation: For the synthesis of synthetic images, we relied on a combination of deep learning frameworks, including scikit-learn and OpenCV. These tools allowed us to augment and manipulate images effectively to create a diverse range of backgrounds and variations.
    • GenAI: Our dataset includes images generated using the Stable Diffusion XL model, along with versions 1.5 and 2.0 of the Stable Diffusion model. These generative models played a pivotal role in crafting realistic and varied backgrounds.

    For a detailed breakdown of our prompt engineering and hyperparameters, we invite you to consult our upcoming paper. This publication will provide comprehensive insights into our methodologies, enabling a deeper understanding of the image generation process.

    How to use our dataset?

    this dataset has been/could be downloaded via Kaggl...

  6. Additional file 3 of seqQscorer: automated quality control of...

    • springernature.figshare.com
    xlsx
    Updated Jun 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steffen Albrecht; Maximilian Sprang; Miguel A. Andrade-Navarro; Jean-Fred Fontaine (2023). Additional file 3 of seqQscorer: automated quality control of next-generation sequencing data using machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.14173806.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 12, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Steffen Albrecht; Maximilian Sprang; Miguel A. Andrade-Navarro; Jean-Fred Fontaine
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 3. Table of tuned models used by seqQscorer. Dataset: species-subset-layout, generic: all data. Feature sets: RAW (raw data), MAP (genome mapping), LOC (genomic localization), TSS (transcription start sites profile). Feature Selection: method-percentage (percentage of retained features), chi-square (chi2), recursive feature elimination (RFE). Algorithm Parameters: relevant to a scikit-learn implementation.

  7. u

    CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost...

    • observatorio-cientifico.ua.es
    • scidb.cn
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel (2025). CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc49bb9e7c03b01be251c
    Explore at:
    Dataset updated
    2025
    Authors
    Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel; Marquez-Carpintero, Luis; Suescun-Ferrandiz, Sergio; Pina-Navarro, Monica; Gomez-Donoso, Francisco; Cazorla, Miguel
    Description

    Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data

    /instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }

  8. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  9. P

    Embrapa ADD 256 Dataset

    • paperswithcode.com
    Updated Oct 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Embrapa ADD 256 Dataset [Dataset]. https://paperswithcode.com/dataset/embrapa-add-256
    Explore at:
    Dataset updated
    Oct 15, 2021
    Description

    This is a detailed description of the dataset, a data sheet for the dataset as proposed by Gebru et al.

    Motivation for Dataset Creation Why was the dataset created? Embrapa ADD 256 (Apples by Drones Detection Dataset — 256 × 256) was created to provide images and annotation for research on *apple detection in orchards for UAV-based monitoring in apple production.

    What (other) tasks could the dataset be used for? Apple detection in low-resolution scenarios, similar to the aerial images employed here.

    Who funded the creation of the dataset? The building of the ADD256 dataset was supported by the Embrapa SEG Project 01.14.09.001.05.04, Image-based metrology for Precision Agriculture and Phenotyping, and FAPESP under grant (2017/19282-7).

    Dataset Composition What are the instances? Each instance consists of an RGB image and an annotation describing apples locations as circular markers (i.e., presenting center and radius).

    How many instances of each type are there? The dataset consists of 1,139 images containing 2,471 apples.

    What data does each instance consist of? Each instance contains an 8-bits RGB image. Its corresponding annotation is found in the JSON files: each apple marker is composed by its center (cx, cy) and its radius (in pixels), as seen below:

    "gebler-003-06.jpg": [ { "cx": 116, "cy": 117, "r": 10 }, { "cx": 134, "cy": 113, "r": 10 }, { "cx": 221, "cy": 95, "r": 11 }, { "cx": 206, "cy": 61, "r": 11 }, { "cx": 92, "cy": 1, "r": 10 } ],

    Dataset.ipynb is a Jupyter Notebook presenting a code example for reading the data as a PyTorch's Dataset (it should be straightforward to adapt the code for other frameworks as Keras/TensorFlow, fastai/PyTorch, Scikit-learn, etc.)

    Is everything included or does the data rely on external resources? Everything is included in the dataset.

    Are there recommended data splits or evaluation measures? The dataset comes with specified train/test splits. The splits are found in lists stored as JSON files.

    | | Number of images | Number of annotated apples | | --- | --- | --- | |Training | 1,025 | 2,204 | |Test | 114 | 267 | |Total | 1,139 | 2,471 |

    Dataset recommended split.

    Standard measures from the information retrieval and computer vision literature should be employed: precision and recall, F1-score and average precision as seen in COCO and Pascal VOC.

    What experiments were initially run on this dataset? The first experiments run on this dataset are described in A methodology for detection and location of fruits in apples orchards from aerial images by Santos & Gebler (2021).

    Data Collection Process How was the data collected? The data employed in the development of the methodology came from two plots located at the Embrapa’s Temperate Climate Fruit Growing Experimental Station at Vacaria-RS (28°30’58.2”S, 50°52’52.2”W). Plants of the varieties Fuji and Gala are present in the dataset, in equal proportions. The images were taken during December 13, 2018, by an UAV (DJI Phantom 4 Pro) that flew over the rows of the field at a height of 12 m. The images mix nadir and non-nadir views, allowing a more extensive view of the canopies. A subset from the images was random selected and 256 × 256 pixels patches were extracted.

    Who was involved in the data collection process? T. T. Santos and L. Gebler captured the images in field. T. T. Santos performed the annotation.

    How was the data associated with each instance acquired? The circular markers were annotated using the VGG Image Annotator (VIA).

    WARNING: Find non-ripe apples in low-resolution images of orchards is a challenging task even for humans. ADD256 was annotated by a single annotator. So, users of this dataset should consider it a noisy dataset.

    Data Preprocessing What preprocessing/cleaning was done? No preprocessing was applied.

    Dataset Distribution How is the dataset distributed? The dataset is available at GitHub.

    When will the dataset be released/first distributed? The dataset was released in October 2021.

    What license (if any) is it distributed under? The data is released under Creative Commons BY-NC 4.0 (Attribution-NonCommercial 4.0 International license). There is a request to cite the corresponding paper if the dataset is used. For commercial use, contact Embrapa Agricultural Informatics business office.

    Are there any fees or access/export restrictions? There are no fees or restrictions. For commercial use, contact Embrapa Agricultural Informatics business office.

    Dataset Maintenance Who is supporting/hosting/maintaining the dataset? The dataset is hosted at Embrapa Agricultural Informatics and all comments or requests can be sent to Thiago T. Santos (maintainer).

    Will the dataset be updated? There is no scheduled updates.

    If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? Contributors should contact the maintainer by e-mail.

    No warranty The maintainers and their institutions are exempt from any liability, judicial or extrajudicial, for any losses or damages arising from the use of the data contained in the image database.

  10. t

    Tour Recommendation Model

    • test.researchdata.tuwien.ac.at
    bin, png +1
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
    Explore at:
    bin, png, text/markdownAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Dataset Description for Tour Recommendation Model

    Context and Methodology:

    • Research Domain/Project:
      This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

    • Purpose:
      The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

    • Creation Methodology:
      The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

    Technical Details:

    • Structure of the Dataset:
      The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

      • place_or_event_id: Unique identifier for each tourist place or event.

      • rating: Rating given by the user, ranging from 1 to 5.

      The data is split into three subsets:

      • Training Set: 80% of the dataset used to train the model.

      • Validation Set: A small portion used for hyperparameter tuning.

      • Test Set: 20% used to evaluate model performance.

    • Folder and File Naming Conventions:
      The dataset files are stored in the following structure:

      • user_ratings_dataset.csv: The original dataset file containing user ratings.

      • tour_recommendation_model.pkl: The saved model after training.

      • actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

    • Software Requirements:
      To open and work with this dataset, the following software and libraries are required:

      • Python 3.x

      • Pandas for data manipulation

      • Scikit-learn for training and evaluating machine learning models

      • Matplotlib for chart generation

      • Joblib for saving and loading the trained model

      The dataset can be opened and processed using any Python environment that supports these libraries.

    • Additional Resources:

      • The model training code, README file, and performance chart are available in the project repository.

      • For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

    Further Details:

    • Dataset Reusability:
      The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

      • Train other types of models (e.g., regression, classification).

      • Experiment with different features or add more metadata to enrich the dataset.

    • Data Integrity:
      The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

    • Licensing:
      The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

  11. Additional file 5 of seqQscorer: automated quality control of...

    • springernature.figshare.com
    xlsx
    Updated Jun 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steffen Albrecht; Maximilian Sprang; Miguel A. Andrade-Navarro; Jean-Fred Fontaine (2023). Additional file 5 of seqQscorer: automated quality control of next-generation sequencing data using machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.14173812.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 12, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Steffen Albrecht; Maximilian Sprang; Miguel A. Andrade-Navarro; Jean-Fred Fontaine
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 5. Table of models tuned using the grid search. Dataset: species-subset-layout, generic: all data. Feature sets: RAW (raw data), MAP (genome mapping), LOC (genomic localization), TSS (transcription start sites profile). Feature Selection: method-percentage (percentage of retained features), chi-square (chi2), recursive feature elimination (RFE). Algorithm Parameters: relevant to a scikit-learn implementation.

  12. Data Science And Ml Platforms Market Report | Global Forecast From 2025 To...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data Science And Ml Platforms Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/data-science-and-ml-platforms-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Science And ML Platforms Market Outlook



    The global market size for Data Science and ML Platforms was estimated to be approximately USD 78.9 billion in 2023, and it is projected to reach around USD 307.6 billion by 2032, growing at a Compound Annual Growth Rate (CAGR) of 16.4% during the forecast period. This remarkable growth can be largely attributed to the increasing adoption of artificial intelligence (AI) and machine learning (ML) across various industries to enhance operational efficiency, predictive analytics, and decision-making processes.



    The surge in big data and the necessity to make sense of unstructured data is a substantial growth driver for the Data Science and ML Platforms market. Organizations are increasingly leveraging data science and machine learning to gain insights that can help them stay competitive. This is especially true in sectors like retail and e-commerce where customer behavior analytics can lead to more targeted marketing strategies, personalized shopping experiences, and improved customer retention rates. Additionally, the proliferation of IoT devices is generating massive amounts of data, which further fuels the need for advanced data analytics platforms.



    Another significant growth factor is the increasing adoption of cloud-based solutions. Cloud platforms offer scalable resources, flexibility, and substantial cost savings, making them attractive for enterprises of all sizes. Cloud-based data science and machine learning platforms also facilitate collaboration among distributed teams, enabling more efficient workflows and faster time-to-market for new products and services. Furthermore, advancements in cloud technologies, such as serverless computing and containerization, are making it easier for organizations to deploy and manage their data science models.



    Investment in AI and ML by key industry players also plays a crucial role in market growth. Tech giants like Google, Amazon, Microsoft, and IBM are making substantial investments in developing advanced AI and ML tools and platforms. These investments are not only driving innovation but also making these technologies more accessible to smaller enterprises. Additionally, mergers and acquisitions in this space are leading to more integrated and comprehensive solutions, which are further accelerating market growth.



    Machine Learning Tools are at the heart of this technological evolution, providing the necessary frameworks and libraries that empower developers and data scientists to create sophisticated models and algorithms. These tools, such as TensorFlow, PyTorch, and Scikit-learn, offer a range of functionalities from data preprocessing to model deployment, catering to both beginners and experts. The accessibility and versatility of these tools have democratized machine learning, enabling a wider audience to harness the power of AI. As organizations continue to embrace digital transformation, the demand for robust machine learning tools is expected to grow, driving further innovation and development in this space.



    From a regional perspective, North America is expected to hold the largest market share due to the early adoption of advanced technologies and the presence of major market players. However, the Asia Pacific region is anticipated to exhibit the highest growth rate during the forecast period. This is driven by increasing investments in AI and ML, a burgeoning start-up ecosystem, and supportive government policies aimed at digital transformation. Countries like China, India, and Japan are at the forefront of this growth, making significant strides in AI research and application.



    Component Analysis



    When analyzing the Data Science and ML Platforms market by component, it's essential to differentiate between software and services. The software segment includes platforms and tools designed for data ingestion, processing, visualization, and model building. These software solutions are crucial for organizations looking to harness the power of big data and machine learning. They provide the necessary infrastructure for data scientists to develop, test, and deploy ML models. The software segment is expected to grow significantly due to ongoing advancements in AI algorithms and the increasing need for more sophisticated data analysis tools.



    The services segment in the Data Science and ML Platforms market encompasses consulting, system integration, and support services. Consulting services help organizatio

  13. Dataset and software for processing of hyperspectral images of different CDW...

    • zenodo.org
    zip
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Václav Nežerka; Václav Nežerka (2024). Dataset and software for processing of hyperspectral images of different CDW materials [Dataset]. http://doi.org/10.5281/zenodo.14294262
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Václav Nežerka; Václav Nežerka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 25, 2024
    Description

    Overview

    The provided scripts are designed to process hyperspectral images of construction and demolition waste (CDW) materials, extract relevant features, and train a machine-learning model for material classification. The scripts perform the following tasks:

    1. Feature Extraction: Extract spectral features from hyperspectral data.
    2. Background Removal and Subset Extraction: Remove backgrounds from images and extract subsets for analysis.
    3. Data Visualization: Generate plots to visualize the extracted features and reflectance curves.
    4. Machine Learning Model Training: Using the extracted features, train and evaluate a multilayer perceptron (MLP) classifier.

    Prerequisites

    Before running the scripts, ensure that you have the following:

    • Python 3.x installed on your system.
    • Required Python packages:
      • numpy
      • matplotlib
      • scipy
      • pandas
      • scikit-learn
      • seaborn
      • rembg (for background removal)
      • Pillow (PIL)
    • Hyperspectral data files in .mat format containing calibrated hyperspectral cubes and wavelength information.
    • A directory structure to organize input and output files as described in each script.

    Scripts Description

    1. hyperspectral_features_v2.py

    Purpose

    This script processes individual hyperspectral image files to extract spectral features from a central subset of the image. It generates RGB images from the hyperspectral data, plots the mean reflectance spectra, and outputs a LaTeX-formatted table containing the extracted features.

    Functionality

    • Loading Data: Reads .mat files containing hyperspectral data from a specified input directory.
    • Feature Calculation:
      • Calculates mean reflectance within a central window of the image.
      • Extracts spectral features such as peak wavelength and area under the reflectance curve.
      • Records reflectance values at selected wavelengths, including standard RGB channels and additional wavelengths.
    • RGB Image Generation: Creates RGB images using specific wavelengths corresponding to the red, green, and blue channels.
    • Spectra Plotting: Plots the mean reflectance spectra for each sample.
    • LaTeX Table Generation: Produces a LaTeX-formatted table of the extracted features for inclusion in a report or paper.

    Usage Instructions

    1. Prepare Input Data:

      • Place your .mat files containing the hyperspectral data in the appropriate input directory (e.g., input/mortar).
    2. Run the Script:

      • Modify the materials list at the end of the script to include the materials you want to process (e.g., materials = ['mortar']).
      • Execute the script:
        bash
  14. S

    Data from: Prediction of magnetic Janus materials based on machine learning...

    • scidb.cn
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang Qiao; Tan Wei; Ning Yongqi; Nie Guozheng; Cai Mengqiu; Wang Junnian; Zhu Huiping; Zhao Yuqing (2024). Prediction of magnetic Janus materials based on machine learning and first-principles calculations [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00072
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Zhang Qiao; Tan Wei; Ning Yongqi; Nie Guozheng; Cai Mengqiu; Wang Junnian; Zhu Huiping; Zhao Yuqing
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The dataset in this study was processed and visualized using the pandas, matplotlib, and matminer libraries within the Scikit-learn framework. This dataset primarily includes data from the processes of machine learning model training, high-throughput material generation and screening, and first-principles calculation validation. The dataset consists of two main files: data_set and data of supplementary materials. The data_set file contains the original data (janus.csv), feature-engineered original data (janus_featured.csv), element-substituted data (prd_janus.csv), feature-engineered element-substituted data (prd_janus_featured.csv), and the model screening results (stable_high_magnetization.csv). The data of supplementary materials file includes Figure S1 and Table S1. Figure S1 presents an analysis of feature importance for lattice constants a = b, lattice constant c, formation energy, and magnetic moment categories during model training. Table S1 provides the formation energy and magnetic moment obtained from direct static self-consistent calculations for 13 unoptimized Janus structures.(1) janus.csv is the original dataset obtained from the Materials Project database, containing information on the chemical composition (elements and stoichiometry), crystal space group, lattice constants, formation energy, and total magnetic moment of 1,179 two-dimensional hexagonal ABC-type Janus materials.(2) janus_featured.csv is the dataset obtained by applying feature engineering based on elemental composition information to the original dataset.(3) prd_janus.csv is a dataset of 82,018 ABC-type two-dimensional Janus materials, not yet experimentally synthesized, generated by random substitution of elements A, B, and C from the periodic table based on the two-dimensional hexagonal ABC-type Janus structures in the original dataset.(4) prd_janus_featured.csv is the feature-engineered dataset of the element-substituted materials.(5) stable_high_magnetization.csv is the dataset obtained by applying a trained machine learning model to the feature-engineered element-substituted data, containing 4,204 Janus structures with lattice information, thermal stability, and high magnetic moment.

  15. P

    EDGE-IIOTSET Dataset

    • paperswithcode.com
    Updated Oct 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). EDGE-IIOTSET Dataset [Dataset]. https://paperswithcode.com/dataset/edge-iiotset
    Explore at:
    Dataset updated
    Oct 16, 2023
    Description

    ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.

    Instructions:

    Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.

    Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...

    Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.

    The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:

    Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809

    Link to paper : https://ieeexplore.ieee.org/document/9751703

    The directories of the Edge-IIoTset dataset include the following:

    •File 1 (Normal traffic)

    -File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.

    -File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.

    -File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.

    -File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.

    -File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.

    -File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.

    -File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.

    -File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.

    -File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.

    •File 2 (Attack traffic):

    -File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.

    -File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.

    •File 3 (Selected dataset for ML and DL):

    -File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.

    -File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.

    Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files

    !pip install -q kaggle

    files.upload()

    !mkdir ~/.kaggle

    !cp kaggle.json ~/.kaggle/

    !chmod 600 ~/.kaggle/kaggle.json

    !kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"

    !unzip DNN-EdgeIIoT-dataset.csv.zip

    !rm DNN-EdgeIIoT-dataset.csv.zip

    Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd

    import numpy as np

    df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)

    Step 3 : Exploring some of the DataFrame's contents: df.head(5)

    print(df['Attack_type'].value_counts())

    Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle

    drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",

     "http.file_data","http.request.full_uri","icmp.transmit_timestamp",
    
     "http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport",
    
     "tcp.dstport", "udp.port", "mqtt.msg"]
    

    df.drop(drop_columns, axis=1, inplace=True)

    df.dropna(axis=0, how='any', inplace=True)

    df.drop_duplicates(subset=None, keep="first", inplace=True)

    df = shuffle(df)

    df.isna().sum()

    print(df['Attack_type'].value_counts())

    Step 5: Categorical data encoding (Dummy Encoding): import numpy as np

    from sklearn.model_selection import train_test_split

    from sklearn.preprocessing import StandardScaler

    from sklearn import preprocessing

    def encode_text_dummy(df, name):

    dummies = pd.get_dummies(df[name])

    for x in dummies.columns:

    dummy_name = f"{name}-{x}"
    
    df[dummy_name] = dummies[x]
    

    df.drop(name, axis=1, inplace=True)

    encode_text_dummy(df,'http.request.method')

    encode_text_dummy(df,'http.referer')

    encode_text_dummy(df,"http.request.version")

    encode_text_dummy(df,"dns.qry.name.len")

    encode_text_dummy(df,"mqtt.conack.flags")

    encode_text_dummy(df,"mqtt.protoname")

    encode_text_dummy(df,"mqtt.topic")

    Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')

    For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com

    More information about Dr. Mohamed Amine Ferrag is available at:

    https://www.linkedin.com/in/Mohamed-Amine-Ferrag

    https://dblp.uni-trier.de/pid/142/9937.html

    https://www.researchgate.net/profile/Mohamed_Amine_Ferrag

    https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao

    https://www.scopus.com/authid/detail.uri?authorId=56115001200

    https://publons.com/researcher/1322865/mohamed-amine-ferrag/

    https://orcid.org/0000-0002-0632-3172

    Last Updated: 27 Mar. 2023

  16. t

    Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

    • researchdata.tuwien.ac.at
    • test.researchdata.tuwien.ac.at
    bin, text/markdown
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns (2025). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.48436/q1kt0-edc53
    Explore at:
    bin, text/markdownAvailable download formats
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    TU Wien
    Authors
    Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2024 - Aug 2024
    Description

    Dataset Card for "privacy-care-interactions"

    Table of Contents

    Dataset Description

    Purpose and Features

    🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

    The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

    Dataset Overview

    Language Distribution 🌍

    • English (en): 95

    Locale Distribution 🌎

    • United States (US) 🇺🇸: 95

    Key Facts 🔑

    • This is synthetic data! Generated using proprietary algorithms - no privacy violations!
    • Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).
    • The data was manually labeled by an expert.

    Dataset Structure

    Data Instances

    The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

    { "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

    Data Fields

    The data fields are:

    • text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).
    • taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.
    • category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.
    • affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.
    • language: a string feature. Language code as defined by ISO 639.
    • locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.
    • data_type: a string a classification label, with possible values including real (0), synthetic (1).
    • uid: a int64 feature. A unique identifier within the dataset.
    • split: a string feature. Either train, validation or test.

    Dataset Splits

    The dataset has 2 subsets:

    • split: with a total of 95 examples split into train, validation and test (70%-15%-15%)
    • unsplit: with a total of 95 examples in a single train split
    nametrainvalidationtest
    split661415
    unsplit95n/an/a

    The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

    • split-train-en.jsonl
    • split-validation-en.jsonl
    • split-test-en.jsonl
    • unsplit-train-en.jsonl

    Dataset Creation

    Curation Rationale

    Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

    Source Data

    Initial Data Collection

    The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

    Data Processing

    The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the sections were translated from German to U.S. English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank" rel="noopener

  17. E-commerce Sales Prediction Dataset

    • kaggle.com
    Updated Dec 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevil Dhinoja (2024). E-commerce Sales Prediction Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/10197264
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2024
    Dataset provided by
    Kaggle
    Authors
    Nevil Dhinoja
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    E-commerce Sales Prediction Dataset

    This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.

    📂 Dataset Overview

    The dataset includes 1,000 records across the following features:

    Column NameDescription
    DateThe date of the sale (01-01-2023 onward).
    Product_CategoryCategory of the product (e.g., Electronics, Sports, Other).
    PricePrice of the product (numerical).
    DiscountDiscount applied to the product (numerical).
    Customer_SegmentBuyer segment (e.g., Regular, Occasional, Other).
    Marketing_SpendMarketing budget allocated for sales (numerical).
    Units_SoldNumber of units sold per transaction (numerical).

    📊 Data Summary

    General Properties

    Date: - Range: 01-01-2023 to 12-31-2023. - Contains 1,000 unique values without missing data.

    Product_Category: - Categories: Electronics (21%), Sports (21%), Other (58%). - Most common category: Electronics (21%).

    Price: - Range: From 244 to 999. - Mean: 505, Standard Deviation: 290. - Most common price range: 14.59 - 113.07.

    Discount: - Range: From 0.01% to 49.92%. - Mean: 24.9%, Standard Deviation: 14.4%. - Most common discount range: 0.01 - 5.00%.

    Customer_Segment: - Segments: Regular (35%), Occasional (34%), Other (31%). - Most common segment: Regular.

    Marketing_Spend: - Range: From 2.41k to 10k. - Mean: 4.91k, Standard Deviation: 2.84k.

    Units_Sold: - Range: From 5 to 57. - Mean: 29.6, Standard Deviation: 7.26. - Most common range: 24 - 34 units sold.

    📈 Data Visualizations

    The dataset is suitable for creating the following visualizations: - 1. Price Distribution: Histogram to show the spread of prices. - 2. Discount Distribution: Histogram to analyze promotional offers. - 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns. - 4. Customer Segment Distribution: Bar plot of customer segments. - 5. Price vs Units Sold: Scatter plot to show pricing effects on sales. - 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts. - 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness. - 8. Correlation Heatmap: Identify relationships between features. - 9. Pairplot: Visualize pairwise feature interactions.

    💡 How the Data Was Created

    The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:

    1. Feature Engineering:

      • Identified key attributes such as product category, price, discount, and marketing spend, typically observed in e-commerce data.
      • Generated dependent features like units sold based on logical relationships.
    2. Data Simulation:

      • Python Libraries: Used NumPy and Pandas to generate and distribute values.
      • Statistical Modeling: Ensured feature distributions aligned with real-world sales data patterns.
    3. Validation:

      • Verified data consistency with no missing or invalid values.
      • Ensured logical correlations (e.g., higher discounts → increased units sold).

    Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.

    🛠 Example Usage: Sales Prediction Model

    Here’s an example of building a predictive model using Linear Regression:

    Written in python

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error, r2_score
    
    # Load the dataset
    df = pd.read_csv('ecommerce_sales.csv')
    
    # Feature selection
    X = df[['Price', 'Discount', 'Marketing_Spend']]
    y = df['Units_Sold']
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Model training
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Evaluation
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    print(f'Mean Squared Error: {mse:.2f}')
    print(f'R-squared: {r2:.2f}')
    
  18. t

    Rainfall Prediction: Comparison of 7 Popular Models

    • test.researchdata.tuwien.ac.at
    bin, png +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaya Ali Kus; Kaya Ali Kus (2025). Rainfall Prediction: Comparison of 7 Popular Models [Dataset]. http://doi.org/10.70124/p7rh4-0g783
    Explore at:
    png, text/markdown, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Kaya Ali Kus; Kaya Ali Kus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Rainfall Prediction using 7 Popular Models

    Context and Methodology

    Research Domain/Project:

    This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.

    Purpose:

    The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.

    Creation Process:

    The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.

    Technical Details


    Dataset Structure:

    The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:

    Temperature
    Humidity
    Wind Speed
    Pressure
    Rainfall (target variable)
    These features are tracked for each weather station over different times, with the goal of predicting rainfall.

    Software Requirements:

    Python: The primary programming language for data analysis and machine learning.
    scikit-learn: For implementing machine learning models.
    XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
    Matplotlib/Seaborn: For data visualization.
    These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
    DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.

    Additional Resources

    Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
    Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
    Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.

  19. Additional file 4 of seqQscorer: automated quality control of...

    • springernature.figshare.com
    xlsx
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steffen Albrecht; Maximilian Sprang; Miguel A. Andrade-Navarro; Jean-Fred Fontaine (2023). Additional file 4 of seqQscorer: automated quality control of next-generation sequencing data using machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.14173809.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Steffen Albrecht; Maximilian Sprang; Miguel A. Andrade-Navarro; Jean-Fred Fontaine
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 4. Table of optimal models used by seqQscorer. Optimal models are the best models for each data subset (Dataset column). Dataset: species-subset-layout, generic: all data. Feature sets: RAW (raw data), MAP (genome mapping), LOC (genomic localization), TSS (transcription start sites profile). Feature Selection: method-percentage (percentage of retained features), chi-square (chi2), recursive feature elimination (RFE). Algorithm Parameters: relevant to a scikit-learn implementation.

  20. 4

    Supporting code and data for: Dynamic capacity balancing in urban airspace:...

    • data.4tu.nl
    zip
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andres Morfin Veytia; Joost Ellerbroek; Jacco Hoekstra (2024). Supporting code and data for: Dynamic capacity balancing in urban airspace: comparing historical and real-time aggregate flow data [Dataset]. http://doi.org/10.4121/c3a4912b-b248-4ad2-8c73-15b505f4287f.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Andres Morfin Veytia; Joost Ellerbroek; Jacco Hoekstra
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Rotterdam, the Netherlands
    Description

    This repository contains some supporting data and code for the 2024 SESAR Innovation Days Conference paper "Dynamic capacity balancing in urban airspace: comparing historical and real-time aggregate flow data."


    Note that this is a continuation of the work published here:


    Morfin Veytia, Andres; Ellerbroek, Joost; Hoekstra, Jacco (2024): Supporting data and code for Decentralised Traffic Management for Constrained Urban Airspace: Dynamically Generating and Acting Upon Aggregate Flow Data. Version 2. 4TU.ResearchData. dataset. https://doi.org/10.4121/54825f14-8743-447d-8346-3afa46d319d6.v2


    Therefore, much of the data and code is similar. However, this work provides some additional code and scenarios.


    The main components need to reproduce the results are:


    1. BlueSky simulator code to reproduce the simulations.
    2. Post processing code for the scenarios to generate the plots in the paper and logs.
    3. Voronoi creation code
    4. Python environment description.


    1. BlueSky Simulator code

    This includes the BlueSky code for simulating the scenarios. This is the bluesky.zip folder. Note that the code provided is a condensed version of the one in https://github.com/amorfinv/bluesky/tree/rotterdam. The plugins and scenarios are also provided in the simulator code. The plugins are based on those based in the following repository, https://github.com/amorfinv/bluesky_plugins.


    Refer to the README.md file provided to learn how to run the scenarios. Also, make sure to install a compatible python environment.


    2. Post-processing code, plots, and logs

    This includes the code to generate the plots seen in the paper and the logs of the simulations. It also includes some additional plots not shown in the paper. Read the README.md file for recreating the plots. This information can be found in main_experiment_results.zip. Some of the logs come from the previous work. The previous logs are labelled as real-time data labelling in this paper.


    3. Voronoi creation code

    The file called generate_voronois.zip includes the code to generate the voronoi code used for the historical data concept in the paper. Note that to generate the voronoi a more recent version of geopandas is necessary, so a different python environment is required. All you need is python=3.12, geopandas=1.0.1 and scikit-learn=1.5.1.


    6. Python environment description

    This includes the python environment used to simulate, post-process, and generate the plots. This work used conda environments. The main packages used are those required by BlueSky in addition to geopandas, osmnx, and seaborn. Note that the voronoi creation requires a different python environment.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cahlen Humphreys (2025). cdg-AICourse-Level2-Sklearn [Dataset]. https://huggingface.co/datasets/cahlen/cdg-AICourse-Level2-Sklearn

cdg-AICourse-Level2-Sklearn

cahlen/cdg-AICourse-Level2-Sklearn

Explore at:
Dataset updated
Apr 16, 2025
Authors
Cahlen Humphreys
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

Professor Enfuse & Learner: Comparing supervised and unsupervised learning strategies for real-world applications - Generated by Conversation Dataset Generator

This dataset was generated using the Conversation Dataset Generator script available at https://cahlen.github.io/conversation-dataset-generator/.

  Generation Parameters

Number of Conversations Requested: 500 Number of Conversations Successfully Generated: 500 Total Turns: 6659 Model ID:… See the full description on the dataset page: https://huggingface.co/datasets/cahlen/cdg-AICourse-Level2-Sklearn.

Search
Clear search
Close search
Google apps
Main menu