55 datasets found
  1. Distributed Training with Kubeflow

    • kaggle.com
    Updated Jul 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Camelia Ben Laamari (2021). Distributed Training with Kubeflow [Dataset]. https://www.kaggle.com/cameliabenlaamari/distributed-training-with-kubeflow
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 30, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Camelia Ben Laamari
    Description

    Dataset

    This dataset was created by Camelia Ben Laamari

    Contents

  2. NVIDIA Apex

    • kaggle.com
    zip
    Updated Apr 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiran Kunapuli (2020). NVIDIA Apex [Dataset]. https://www.kaggle.com/kirankunapuli/nvidia-apex
    Explore at:
    zip(548658 bytes)Available download formats
    Dataset updated
    Apr 14, 2020
    Authors
    Kiran Kunapuli
    Description

    How to use

    Add this dataset to your notebook, then execute the following command in a new cell !cd ../input/nvidia-apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

    Context

    A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

    Content

    https://github.com/NVIDIA/apex/blob/master/README.md As of 14th April 2020.

    Acknowledgements

    NVIDIA Apex Photo by Cas Magee on Unsplash License

  3. CYBRIA - Federated Learning Network Security - IoT

    • kaggle.com
    zip
    Updated Apr 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ptdevsecops (2024). CYBRIA - Federated Learning Network Security - IoT [Dataset]. https://www.kaggle.com/datasets/ptdevsecops/cybria-federated-learning-network-security-iot
    Explore at:
    zip(6873653 bytes)Available download formats
    Dataset updated
    Apr 22, 2024
    Authors
    ptdevsecops
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    **CYBRIA - Pioneering Federated Learning for Privacy-Aware Cybersecurity with Brilliance ** Research study a federated learning framework for collaborative cyber threat detection without compromising confidential data. The decentralized approach trains models on local data distributed across clients and shares only intermediate model updates to generate an integrated global model.

    **If you use this dataset and code or any herein modified part of it in any publication, please cite these papers: ** P. Thantharate and A. T, "CYBRIA - Pioneering Federated Learning for Privacy-Aware Cybersecurity with Brilliance," 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET), Boca Raton, FL, USA, 2023, pp. 56-61, doi: 10.1109/HONET59747.2023.10374608.

    For any questions and research queries - please reach out via Email.

    Key Objectives - Develop a federated learning framework called Cybria for collaborative cyber threat detection without compromising confidential data - Evaluate model performance for intrusion detection using the Bot-IoT dataset

    Proposed Solutions - Designed a privacy-preserving federated learning architecture tailored for cybersecurity applications Implemented the Cybria model using TensorFlow Federated and Flower libraries - Employed a decentralized approach where models are trained locally on clients and only model updates are shared

    Simulated Results - Cybria's federated model achieves 89.6% accuracy for intrusion detection compared to 81.4% for a centralized DNN The federated approach shows 8-10% better performance, demonstrating benefits of collaborative yet decentralized learning - Local models allow specialized learning tuned to each client's data characteristics

    Conclusion - Preliminary results validate potential of federated learning to enhance cyber threat detection accuracy in a privacy-preserving manner - Detailed studies needed to optimize model architectures, hyperparameters, and federation strategies for large real-world deployments - Approach helps enable an ecosystem for collective security knowledge without increasing data centralization risks

    References The implementation would follow the details provided in the original research paper: Thantharate and A. T,

    "CYBRIA - Pioneering Federated Learning for Privacy-Aware Cybersecurity with Brilliance," 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET), Boca Raton, FL, USA, 2023, pp. 56-61, doi: 10.1109/HONET59747.2023.10374608.

    Any additional external libraries or sources used would be properly cited.

    Tags - Federated learning, privacy-preserving machine learning, collaborative cyber threat detection, decentralized model training, intermediate model updates, integrated global model, cybersecurity, data privacy, distributed computing, secure aggregation, model personalization, adversarial attacks, anomaly detection, network traffic analysis, malware classification, intrusion prevention, threat intelligence, edge computing, data minimization, differential privacy.

  4. Distributed peer review anonymized dataset

    • kaggle.com
    zip
    Updated May 5, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sirisha Siri (2021). Distributed peer review anonymized dataset [Dataset]. https://www.kaggle.com/ishadss/distributed-peer-review-anonymized-dataset
    Explore at:
    zip(30930 bytes)Available download formats
    Dataset updated
    May 5, 2021
    Authors
    Sirisha Siri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    While ancient scientists often had patrons to fund their work, peer review of proposals for the allocation of resources is a foundation of modern science

    Content

    This is the anonymized dataset obtained from the DPR Experiment run at ESO in Fall 2018

    Acknowledgements

    previous work available at 10.1038/s41550-020-1038-y

  5. Datasets for federated learning

    • kaggle.com
    zip
    Updated Dec 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wonghoitin (2022). Datasets for federated learning [Dataset]. https://www.kaggle.com/wonghoitin/datasets-for-federated-learning
    Explore at:
    zip(30618359 bytes)Available download formats
    Dataset updated
    Dec 29, 2022
    Authors
    wonghoitin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)

    source:

    1. smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain

    2. heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain

    3. water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain

    4. customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain

    5. insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain

    6. credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain

    7. income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain

    8. machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain

    9. skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)

    10. score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain

  6. Edelweiss Image Dataset

    • kaggle.com
    zip
    Updated Jun 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fransiscus Rolanda Malau (2022). Edelweiss Image Dataset [Dataset]. https://www.kaggle.com/datasets/ndomalau/edelweis-flower
    Explore at:
    zip(12912266177 bytes)Available download formats
    Dataset updated
    Jun 19, 2022
    Authors
    Fransiscus Rolanda Malau
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context

    Image classification is one of the fundamental tasks in computer vision and machine learning. High-quality datasets are crucial for training robust models that can accurately identify different species. This dataset focuses on three distinct species commonly found in mountainous regions, providing a balanced collection of images for both training and evaluation purposes.

    Content

    This dataset contains 4,550 high-quality images distributed across three categories: - Training set: 3,500 images (approximately 1,167 images per class) - Test set: 1,050 images (350 images per class)

    The dataset is organized in a structured format with separate directories for: 1. Anaphalis Javanica 2. Leontopodium Alpinum 3. Leucogenes Grandiceps

    Each image in the dataset has been carefully prepared to ensure consistency and quality for machine learning applications. The balanced distribution between classes helps prevent bias during model training.

    Applications

    • Species classification and identification
    • Computer vision model development
    • Educational purposes in botany and biodiversity studies
    • Benchmarking machine learning algorithms

    The dataset's clean split between training and test sets makes it ideal for developing and evaluating classification models while following machine learning best practices.

  7. SpaceNet: A Comprehensive Astronomical Dataset

    • kaggle.com
    zip
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raza Imam (2024). SpaceNet: A Comprehensive Astronomical Dataset [Dataset]. https://www.kaggle.com/datasets/razaimam45/spacenet-an-optimally-distributed-astronomy-data
    Explore at:
    zip(56552989870 bytes)Available download formats
    Dataset updated
    Aug 30, 2024
    Authors
    Raza Imam
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Description:

    SpaceNet, attained via a novel double-stage augmentation framework: FLARE https://arxiv.org/pdf/2405.13267, is a hierarchically structured and high-quality astronomical image dataset designed for fine-grained and macro classification tasks. Comprising approximately 12,900 samples, SpaceNet integrates lower (LR) to higher resolution (HR) conversion with standard augmentations and a diffusion approach for synthetic sample generation. This dataset enables superior generalization on various recogntion tasks like classification.

    Dataset Structure

    • Fine-Grained Classes: 8 classes including planets, galaxies, asteroids, nebulae, comets, black holes, stars, and constellations.

    Dataset Composition

    Total Samples: Approximately 12,900 images. Fine-Grained Class Distribution: - Asteroid: 283 files - Black Hole: 656 files - Comet: 416 files - Constellation: 1,552 files - Galaxy: 3,984 files - Nebula: 1,192 files - Planet: 1,472 files - Star: 3,269 files

    Usage

    SpaceNet is suitable for:

    • Training and evaluating machine learning models on fine-grained and macro astronomical classification tasks.
    • Research on hierarchical classification approaches in the astronomy domain.
    • Developing robust models that generalize well across in-domain and out-of-domain datasets.

    Citation

    If you use SpaceNet in your research, please cite it as follows: python @misc{alamimam2024flare, title={FLARE up your data: Diffusion-based Augmentation Method in Astronomical Imaging}, author={Mohammed Talha Alam and Raza Imam and Mohsen Guizani and Fakhri Karray}, year={2024}, eprint={2405.13267}, archivePrefix={arXiv}, primaryClass={cs.CV} }

  8. Distributed Digital Learning Student Dataset

    • kaggle.com
    zip
    Updated Nov 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zyan1999 (2025). Distributed Digital Learning Student Dataset [Dataset]. https://www.kaggle.com/datasets/zyan1999/distributed-digital-learning-student-dataset
    Explore at:
    zip(48163 bytes)Available download formats
    Dataset updated
    Nov 10, 2025
    Authors
    zyan1999
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset consists of 2,500 student records collected from multiple institutions, capturing demographic information, learning habits, and engagement metrics. Each record includes features such as age, gender, study hours per week, attendance rate, assignment and quiz scores, participation score, internet access quality, and frequency of resource usage. The target column, final_grade, categorizes student performance as High, Medium, or Low. Designed to support research on distributed digital learning systems, this dataset enables analysis of multi-institutional collaboration, personalized learning, and performance prediction while preserving student and institutional privacy.

    Column Description:

    student_id: A unique identifier for each student.

    institution_id: The institution or organization to which the student belongs.

    age: The student’s age in years.

    gender: The student’s gender (Male, Female, or Other).

    study_hours_per_week: Average number of hours the student spends studying weekly.

    attendance_rate: Percentage of classes attended by the student.

    assignment_score: Average score obtained by the student on assignments (0–100).

    quiz_score: Average score obtained by the student on quizzes (0–100).

    participation_score: Level of engagement in class discussions or activities (0–100).

    internet_access_quality: Rating of the student’s internet connection quality (1–5).

    resource_access_frequency: Number of times the student accesses learning resources per week.

    final_grade: Overall performance category of the student (High, Medium, or Low).

  9. Student Performance Dataset

    • kaggle.com
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghulam Muhammad Nabeel (2025). Student Performance Dataset [Dataset]. https://www.kaggle.com/datasets/nabeelqureshitiii/student-performance-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ghulam Muhammad Nabeel
    Description

    📊 Student Performance Dataset (Synthetic, Realistic)

    Overview

    This dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.

    Each row represents one student with features like study hours, attendance, class participation, and final score.
    The dataset is small, clean, and structured to be beginner-friendly.

    🔑 Columns Description

    • student_id → Unique identifier for each student.
    • weekly_self_study_hours → Average weekly self-study hours (0–40). Generated using a normal distribution centered around 15 hours.
    • attendance_percentage → Attendance percentage (50–100). Simulated with a normal distribution around 85%.
    • class_participation → Score between 0–10 indicating how actively the student participates in class. Generated from a normal distribution centered around 6.
    • total_score → Final performance score (0–100). Calculated as a function of study hours + random noise, then clipped between 0–100. Stronger correlation with study hours.
    • grade → Categorical label (A, B, C, D, F) derived from total_score.

    📐 Data Generation Logic

    1. Weekly Study Hours: Modeled using a normal distribution (mean ≈ 15, std ≈ 7), capped between 0 and 40 hours.
    2. Scores: More study hours → higher score. Formula:

    Random noise simulates differences in learning ability, motivation, etc.

    1. Attendance & Participation: Independent but realistic variations added.
    2. Grades: Assigned from scores using thresholds:
    • A: ≥ 85
    • B: ≥ 70
    • C: ≥ 55
    • D: ≥ 40
    • F: < 40

    🎯 How to Use This Dataset

    Regression Tasks

    • Predict total_score from weekly_self_study_hours.
    • Train and evaluate Linear Regression models.
    • Extend to multiple regression using attendance_percentage and class_participation.

    Classification Tasks

    • Predict grade (A–F) using study hours, attendance, and participation.

    Model Evaluation Practice

    • Apply train-test split and cross-validation.
    • Evaluate with MAE, RMSE, R².
    • Compare simple vs. multiple regression.

    ✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).

  10. Garbage Dataset

    • kaggle.com
    zip
    Updated Dec 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suman Kunwar (2024). Garbage Dataset [Dataset]. https://www.kaggle.com/datasets/sumn2u/garbage-classification-v2
    Explore at:
    zip(780289207 bytes)Available download formats
    Dataset updated
    Dec 12, 2024
    Authors
    Suman Kunwar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains images of garbage items categorized into 10 classes, designed for machine learning and computer vision projects focusing on recycling and waste management. It is ideal for building classification or object detection models or developing AI-powered solutions for sustainable waste disposal.

    Dataset Summary

    The dataset features 10 distinct classes of garbage with a total of 19,762 images, distributed as follows:

    • Metal: 1020
    • Glass: 3061
    • Biological: 997
    • Paper: 1680
    • Battery: 944
    • Trash: 947
    • Cardboard: 1825
    • Shoes: 1977
    • Clothes: 5327
    • Plastic: 1984

    Key Features - Diverse Categories: Covers common household waste items for a wide range of applications. - Balanced Distribution: Each class is sufficiently populated, ensuring robust model training. - High-Quality Images: Clear and well-annotated images for better performance in computer vision tasks. - Real-World Applications: Ideal for building recycling solutions, waste segregation apps, and educational tools.

    Academic Reference The dataset was featured in the research paper, "Managing Household Waste Through Transfer Learning", showcasing its utility in real-world applications. Researchers and developers can replicate or extend the experiments for further studies.

    Applications - AI for Sustainability: Train AI models to classify garbage and promote automated waste management. - Recycling Programs: Build systems to sort garbage into recyclable and non-recyclable materials. - Environmental Education: Develop tools to teach kids and adults about proper waste disposal.

    Feedbacks

    Thank you for your interest in our waste dataset. Whether you have used the dataset or are considering its use, your feedback is crucial to help us understand your needs and improve the dataset. Please take a few minutes to share your thoughts and experiences through this feedback form. Your input is greatly appreciated.

    We also welcome feedback and contributions to our project on GitHub. Your suggestions and collaboration can help us enhance the dataset and improve the model's performance. Let's work together to make a positive difference!

  11. avila_dataset

    • kaggle.com
    zip
    Updated May 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HRITABAN GHOSH (2022). avila_dataset [Dataset]. https://www.kaggle.com/datasets/hritaban02/avila-dataset
    Explore at:
    zip(604026 bytes)Available download formats
    Dataset updated
    May 10, 2022
    Authors
    HRITABAN GHOSH
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is made from the Avila dataset obtained from the UCI Machine Learning Repository. Here is the description of the data from the above source:

    Data Set Information:

    Data have been normalized by using the Z-normalization method and divided into two data sets: a training set containing 10430 samples, and a test set containing the 10437 samples.

    CLASS DISTRIBUTION (training set) A: 4286 B: 5 C: 103 D: 352 E: 1095 F: 1961 G: 446 H: 519 I: 831 W: 44 X: 522 Y: 266

    Attribute Information:

    F1: intercolumnar distance F2: upper margin F3: lower margin F4: exploitation F5: row number F6: modular ratio F7: interlinear spacing F8: weight F9: peak number F10: modular ratio/ interlinear spacing Class: A, B, C, D, E, F, G, H, I, W, X, Y

  12. Dog vs Cat

    • kaggle.com
    Updated Sep 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2024). Dog vs Cat [Dataset]. http://doi.org/10.34740/kaggle/dsv/9498291
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2024
    Dataset provided by
    Kaggle
    Authors
    AnthonyTherrien
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset contains a total of 1000 images, with an equal distribution of 500 images of dog and 500 images of cat. The images are standardized to a resolution of 512x512 pixels.

    Details

    • Total Images: 1000
      • Dog: 500 images
      • Cat: 500 images
    • Image Resolution: 512x512 pixels
    • File Format: .png
    • Source: Images generated using Stable Diffusion 1.5

    Usage

    This dataset is ideal for tasks such as: - Binary classification - Image recognition and processing - Machine learning and deep learning model training

  13. Car vs Bike Classification Dataset

    • kaggle.com
    • gts.ai
    zip
    Updated Oct 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepNets (2022). Car vs Bike Classification Dataset [Dataset]. https://www.kaggle.com/datasets/utkarshsaxenadn/car-vs-bike-classification-dataset/code
    Explore at:
    zip(107824115 bytes)Available download formats
    Dataset updated
    Oct 28, 2022
    Authors
    DeepNets
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This data set is a collection of 2,000 Bike and Car images. While collecting these images, It was made sure that all types of bikes and cars are included in the image collection. This is because of the high Intra-variety of cars and bikes. That is, there are different types of cars and bikes, which make it a little tough task for the model because the model will also have to understand the high variety of bikes and cars. But if your model is able to understand the basic structure of a car and a bike, it will be able to distinguish between both classes.

    The data is not preprocessed. This is done intentionally so that you can apply the augmentations you want to use. Almost all the 2000 images are unique. So after applying some data augmentation, you can increase the size of the data set.

    The data is not distributed into training and validation subsets. But you can easily do so by using an Image data generator from Keras. The preprocessing steps are available in the my notebook associated with this data set. You can practice your computer vision skills using this data set. This is a binary classification task.

  14. Phishing URL Content Dataset

    • kaggle.com
    zip
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaditey Pillai (2024). Phishing URL Content Dataset [Dataset]. https://www.kaggle.com/datasets/aaditeypillai/phishing-website-content-dataset
    Explore at:
    zip(62701 bytes)Available download formats
    Dataset updated
    Nov 25, 2024
    Authors
    Aaditey Pillai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Phishing URL Content Dataset

    Executive Summary

    Motivation:
    Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.

    Applications:
    - Building robust phishing detection systems.
    - Enhancing security measures in email filtering and web browsing.
    - Training cybersecurity practitioners in identifying malicious URLs.

    The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.

    Description of Data

    This dataset comprises two types of URLs:
    1. Phishing URLs: Malicious URLs designed to deceive users. 2. Benign URLs: Legitimate URLs posing no harm to users.

    Key Features:
    - URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
    - Content-based features: Link density, iframe presence, external/internal links, and metadata.
    - Certificate-based features: SSL/TLS details like validity period and organization.
    - WHOIS data: Registration details like creation and expiration dates.

    Statistics:
    - Total Samples: 800 (400 phishing, 400 benign).
    - Features: 22 including URL, domain, link density, and SSL attributes.

    Power Analysis

    To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.

    Exploratory Data Analysis (EDA)

    Insights from EDA:
    - Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts. - Bar Plots: Class distribution and protocol usage trends. - Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns. - Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.

    EDA visualizations are provided in the repository.

    Link to Publicly Available Data and Code

    The repository contains the Python code used to extract features, conduct EDA, and build the dataset.

    Ethics Statement

    Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
    1. Protects User Privacy: No personally identifiable information is included.
    2. Promotes Ethical Use: Intended solely for academic and research purposes.
    3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.

    Risks:
    - Misuse of the dataset for creating more deceptive phishing attacks.
    - Over-reliance on outdated features as phishing tactics evolve.

    Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.

    Open Source License

    This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.

  15. Federated Health Records Dataset

    • kaggle.com
    zip
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). Federated Health Records Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/federated-health-records-dataset
    Explore at:
    zip(67310 bytes)Available download formats
    Dataset updated
    May 15, 2025
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset, titled "Federated Health Records for Privacy-Preserving AI Research," is a healthcare dataset designed to support research and experimentation in Federated Learning (FL) and Homomorphic Encryption (HE) for secure artificial intelligence applications.

    Each record represents a simulated patient's health profile, including key features such as age, BMI, blood pressure, glucose and insulin levels, physical activity, and diet quality. The dataset is partitioned by client_id, simulating data distributed across multiple hospitals or mobile devices, where direct data sharing is restricted due to privacy concerns.

    The target variable, risk_of_diabetes, is a binary indicator derived from a logistic function applied to health metrics, helping researchers model classification tasks in a privacy-aware environment.

    💡 Key Features Federated-ready: Labeled by client_id to simulate decentralized data sources.

    Privacy-focused: Supports homomorphic encryption-based model updates.

    Flexible use: Suitable for classification, secure model aggregation, and robustness testing.

  16. dataset of SMC

    • kaggle.com
    zip
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tan jingwen999 (2024). dataset of SMC [Dataset]. https://www.kaggle.com/datasets/tanjingwen999/dataset-of-smc
    Explore at:
    zip(300604211 bytes)Available download formats
    Dataset updated
    May 24, 2024
    Authors
    Tan jingwen999
    Description

    This is the dataset used in "An Adaptability-Enhanced Few-Shot Website Fingerprinting Attack Based on Collusion", which consists of four following datasets. [1] V. Rimmer, D. Preuveneers, M. Juarez, T. Van Goethem, and W. Joosen, “Automated website fingerprinting through deep learning,” Network and Distributed System Security Symposium, 2017. [2] P. Sirinam, M. Imani, M. Juarez, and M. Wright, “Deep fingerprinting: Undermining website fingerprinting defenses with deep learning,” in Proceedings of the 2018 ACM Conference on Computer and Communications Security, 2018, pp. 1928–1943. [3] T. Wang, X. Cai, R. Nithyanand, R. Johnson, and I. Goldberg, “Effective attacks and provable defenses for website fingerprinting,” in 23rd USENIX Security Symposium, 2014, pp. 143–157. [4] J. Gong and T. Wang, “Zero-delay lightweight defenses against website fingerprinting,” in 29th USENIX Security Symposium, 2020, pp. 717–734.

  17. Software Engineering Interview Questions Dataset

    • kaggle.com
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    syedmharis (2023). Software Engineering Interview Questions Dataset [Dataset]. https://www.kaggle.com/datasets/syedmharis/software-engineering-interview-questions-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Kaggle
    Authors
    syedmharis
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Comprehensive Software Engineering Interview Questions Dataset

    Description: Overview This dataset is an extensive collection of software engineering interview questions, designed to mirror the complexity and depth of questions asked in interviews at top tech companies, including FAANG (Facebook, Amazon, Apple, Netflix, Google). It encompasses a wide range of topics, from algorithms and data structures to system design and machine learning. The dataset is curated to assist candidates in preparing for technical interviews and to provide educators and interviewers with a resource for assessing technical skills.

    Dataset Details Number of Questions: 250 Categories Covered: Algorithms, System Design, Machine Learning, Data Structures, Distributed Systems, Networking, Low-level Systems, Security, Database Systems, Artificial Intelligence, Data Engineering. Difficulty Level: Primarily Hard. Format: The dataset is structured in a tabular format with columns for Question Number, Question, Brief Answer, Category, and Difficulty. Usage Scenarios: Interview preparation for candidates, educational resource for learning advanced software engineering concepts, tool for interviewers to structure technical assessments. Potential Analysis Users can perform various analyses, such as:

    Category-wise Distribution: Understand the focus areas in software engineering roles. Difficulty Analysis: Gauge the complexity level of questions typically asked in high-end tech interviews. Trend Analysis: Identify trends in technical questions over recent years, especially in rapidly evolving fields like Machine Learning and AI. Inspiration This dataset is intended to inspire:

    Job Candidates: To prepare comprehensively for technical interviews. Educators: To structure curriculum or coursework around practical, interview-oriented learning. Researchers: To analyze trends in technical interviews and skill requirements in the tech industry. Interviewers/Hiring Managers: To formulate effective interview strategies and questionnaires.

  18. Student Learning Methods: A Survey

    • kaggle.com
    zip
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praneth P (2025). Student Learning Methods: A Survey [Dataset]. https://www.kaggle.com/datasets/pranethp/student-learning-methods-a-survey
    Explore at:
    zip(20219 bytes)Available download formats
    Dataset updated
    Apr 3, 2025
    Authors
    Praneth P
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Student Learning Methods: A Survey - Dataset Description The Student Learning Methods: A Survey dataset comprises responses from 100 university students, with 10 participants surveyed from each of 10 different universities. This dataset explores students' preferences and evaluations of various learning methods based on effectiveness and engagement.

    Key Features of the Dataset: Survey Scope:

    Responses collected from 100 students.

    Participants evenly distributed across 10 different universities (10 students per university).

    Learning Methods Evaluated:

    The dataset includes ratings for various learning techniques, such as:

    Lectures – Traditional classroom-based teaching.

    Case Studies – Analyzing real-world scenarios to understand concepts.

    Group Projects – Collaborative assignments involving multiple students.

    Experiments – Hands-on practical work in labs or controlled settings.

    Online Tutorials – Digital or video-based instructional materials.

    Evaluation Criteria:

    Each learning method is rated on a numerical scale based on:

    Effectiveness – How well students believe the method helps in learning.

    Engagement – How interesting or interactive the method is perceived to be.

    Secondary Evaluations:

    The dataset includes repeated columns for learning methods, potentially representing:

    Post-survey reflections where students reassessed their initial responses.

    Comparative evaluations of different methods after exposure to multiple approaches.

    Overall Effectiveness and Engagement Scores:

    Each student provides aggregate scores summarizing how useful and engaging they found different learning methods overall.

    Potential Use Cases:

    Educational Research – Understanding which teaching techniques are most effective across universities.

    Curriculum Development – Helping educators refine teaching strategies.

    Student-Centric Learning Models – Identifying preferred methods to enhance student engagement.

    Comparative Analysis – Examining how student preferences vary across universities. Survey Scope: Responses collected from 100 students.

    Participants evenly distributed across 10 different universities (10 students per university).

    The surveyed universities include:

    Delhi University (DU) – A large central university in Delhi.

    Jawaharlal Nehru University (JNU) – A well-known research-focused university in Delhi.

    Banaras Hindu University (BHU) – A prestigious university in Varanasi, Uttar Pradesh.

    Aligarh Muslim University (AMU) – A renowned university in Aligarh, Uttar Pradesh.

    Chandigarh University – A fast-growing private university in Punjab.

    Kurukshetra University – A public university in Haryana.

    Himachal Pradesh University (HPU) – A state university in Shimla, Himachal Pradesh.

    Guru Gobind Singh Indraprastha University (GGSIPU) – A Delhi-based state university.

    Dr. B. R. Ambedkar University, Agra – A public university in Uttar Pradesh.

    Uttarakhand Technical University (UTU) – A state technical university in Uttarakhand.

    This dataset offers valuable insights into student learning preferences, enabling researchers and educators to tailor teaching methods for maximum impact.

    Recently Updated Version

  19. ImageNet-R

    • kaggle.com
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    my1nonly (2025). ImageNet-R [Dataset]. https://www.kaggle.com/datasets/my1nonly/imagenet-r
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    my1nonly
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ImageNet-R (ImageNet-Renditions) is a variant of the original ImageNet dataset designed to evaluate the robustness and generalization of image classification models to out-of-distribution (OOD) data. It contains 30,000 images corresponding to 200 classes from ImageNet, but instead of natural photographs, the images are renditions—such as sketches, paintings, cartoons, embroidery, and clay sculptures—that significantly differ in texture and appearance from the original training distribution.

    ImageNet-R serves as a benchmark for assessing how well models trained on standard ImageNet data perform when exposed to domain-shifted inputs, especially those involving non-naturalistic visual styles. It highlights the tendency of many deep learning models to rely heavily on texture rather than shape cues, thus revealing potential brittleness in real-world deployment scenarios.

  20. Sin_captcha_images

    • kaggle.com
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    _SindiK_ (2025). Sin_captcha_images [Dataset]. https://www.kaggle.com/datasets/sindik/sin-captcha-images/code
    Explore at:
    zip(12334116 bytes)Available download formats
    Dataset updated
    May 7, 2025
    Authors
    _SindiK_
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Dataset Description

    This dataset is intended for training and testing machine learning models for CAPTCHA recognition. It contains CAPTCHA images along with corresponding filenames, which represent the text displayed on the image.

    Dataset Content

    The dataset includes +1500 CAPTCHA images, each of which is associated with a filename that corresponds to the text on the image. The images are provided in standard formats.

    Purpose

    The goal of this dataset is to provide data for training machine learning models, including neural networks, for solving CAPTCHA recognition tasks. This dataset can be used for classification tasks and optical character recognition (OCR) challenges.

    Dataset Structure

    • Images: Each file is a CAPTCHA image.
    • Filenames: Each image has a corresponding filename that represents the correct answer (the text shown in the CAPTCHA).

    Example

    • Image: abc123.png
    • Answer: abc123

    License

    This dataset is distributed under the GPL-2 license. The GPL-2 license allows for the use, distribution, and modification of the dataset, with the condition that derivative works must also be distributed under the GPL-2 license. Users must also provide access to the source code if modifications are used to create new projects.

    Data Sources and Credits

    This dataset is based in part on CAPTCHA Data by alizahidraja. The original images from that dataset were used as a foundation, and additional custom CAPTCHA images have been added to expand and diversify the dataset.

    This combination aims to provide a richer and more varied training set for machine learning models focused on CAPTCHA recognition.

    Important Notes

    • When using CAPTCHA images generated by third-party services, ensure that you are not infringing on any copyrights and comply with all legal requirements.
    • This dataset is suitable for research and educational purposes. It can also be used for solving challenges in artificial intelligence and machine learning related to text recognition in images.
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Camelia Ben Laamari (2021). Distributed Training with Kubeflow [Dataset]. https://www.kaggle.com/cameliabenlaamari/distributed-training-with-kubeflow
Organization logo

Distributed Training with Kubeflow

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Camelia Ben Laamari
Description

Dataset

This dataset was created by Camelia Ben Laamari

Contents

Search
Clear search
Close search
Google apps
Main menu