55 datasets found

Distributed Training with Kubeflow
kaggle.com
Updated Jul 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Camelia Ben Laamari (2021). Distributed Training with Kubeflow [Dataset]. https://www.kaggle.com/cameliabenlaamari/distributed-training-with-kubeflow
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Camelia Ben Laamari
Description
Dataset

This dataset was created by Camelia Ben Laamari

Contents
NVIDIA Apex
kaggle.com
zip
Updated Apr 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiran Kunapuli (2020). NVIDIA Apex [Dataset]. https://www.kaggle.com/kirankunapuli/nvidia-apex
Explore at:
zip(548658 bytes)Available download formats
Dataset updated
Apr 14, 2020
Authors
Kiran Kunapuli
Description
How to use

Add this dataset to your notebook, then execute the following command in a new cell !cd ../input/nvidia-apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Context

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Content

https://github.com/NVIDIA/apex/blob/master/README.md As of 14th April 2020.

Acknowledgements

NVIDIA Apex Photo by Cas Magee on Unsplash License
CYBRIA - Federated Learning Network Security - IoT
kaggle.com
zip
Updated Apr 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ptdevsecops (2024). CYBRIA - Federated Learning Network Security - IoT [Dataset]. https://www.kaggle.com/datasets/ptdevsecops/cybria-federated-learning-network-security-iot
Explore at:
zip(6873653 bytes)Available download formats
Dataset updated
Apr 22, 2024
Authors
ptdevsecops
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
**CYBRIA - Pioneering Federated Learning for Privacy-Aware Cybersecurity with Brilliance ** Research study a federated learning framework for collaborative cyber threat detection without compromising confidential data. The decentralized approach trains models on local data distributed across clients and shares only intermediate model updates to generate an integrated global model.

**If you use this dataset and code or any herein modified part of it in any publication, please cite these papers: ** P. Thantharate and A. T, "CYBRIA - Pioneering Federated Learning for Privacy-Aware Cybersecurity with Brilliance," 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET), Boca Raton, FL, USA, 2023, pp. 56-61, doi: 10.1109/HONET59747.2023.10374608.

For any questions and research queries - please reach out via Email.

Key Objectives - Develop a federated learning framework called Cybria for collaborative cyber threat detection without compromising confidential data - Evaluate model performance for intrusion detection using the Bot-IoT dataset

Proposed Solutions - Designed a privacy-preserving federated learning architecture tailored for cybersecurity applications Implemented the Cybria model using TensorFlow Federated and Flower libraries - Employed a decentralized approach where models are trained locally on clients and only model updates are shared

Simulated Results - Cybria's federated model achieves 89.6% accuracy for intrusion detection compared to 81.4% for a centralized DNN The federated approach shows 8-10% better performance, demonstrating benefits of collaborative yet decentralized learning - Local models allow specialized learning tuned to each client's data characteristics

Conclusion - Preliminary results validate potential of federated learning to enhance cyber threat detection accuracy in a privacy-preserving manner - Detailed studies needed to optimize model architectures, hyperparameters, and federation strategies for large real-world deployments - Approach helps enable an ecosystem for collective security knowledge without increasing data centralization risks

References The implementation would follow the details provided in the original research paper: Thantharate and A. T,

"CYBRIA - Pioneering Federated Learning for Privacy-Aware Cybersecurity with Brilliance," 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET), Boca Raton, FL, USA, 2023, pp. 56-61, doi: 10.1109/HONET59747.2023.10374608.

Any additional external libraries or sources used would be properly cited.

Tags - Federated learning, privacy-preserving machine learning, collaborative cyber threat detection, decentralized model training, intermediate model updates, integrated global model, cybersecurity, data privacy, distributed computing, secure aggregation, model personalization, adversarial attacks, anomaly detection, network traffic analysis, malware classification, intrusion prevention, threat intelligence, edge computing, data minimization, differential privacy.
Distributed peer review anonymized dataset
kaggle.com
zip
Updated May 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sirisha Siri (2021). Distributed peer review anonymized dataset [Dataset]. https://www.kaggle.com/ishadss/distributed-peer-review-anonymized-dataset
Explore at:
zip(30930 bytes)Available download formats
Dataset updated
May 5, 2021
Authors
Sirisha Siri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

While ancient scientists often had patrons to fund their work, peer review of proposals for the allocation of resources is a foundation of modern science

Content

This is the anonymized dataset obtained from the DPR Experiment run at ESO in Fall 2018

Acknowledgements

previous work available at 10.1038/s41550-020-1038-y
Datasets for federated learning
kaggle.com
zip
Updated Dec 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wonghoitin (2022). Datasets for federated learning [Dataset]. https://www.kaggle.com/wonghoitin/datasets-for-federated-learning
Explore at:
zip(30618359 bytes)Available download formats
Dataset updated
Dec 29, 2022
Authors
wonghoitin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)

source:

smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain

heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain

water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain

customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain

insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain

credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain

income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain

machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain

skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)

score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain
Edelweiss Image Dataset
kaggle.com
zip
Updated Jun 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fransiscus Rolanda Malau (2022). Edelweiss Image Dataset [Dataset]. https://www.kaggle.com/datasets/ndomalau/edelweis-flower
Explore at:
zip(12912266177 bytes)Available download formats
Dataset updated
Jun 19, 2022
Authors
Fransiscus Rolanda Malau
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context

Image classification is one of the fundamental tasks in computer vision and machine learning. High-quality datasets are crucial for training robust models that can accurately identify different species. This dataset focuses on three distinct species commonly found in mountainous regions, providing a balanced collection of images for both training and evaluation purposes.

Content

This dataset contains 4,550 high-quality images distributed across three categories: - Training set: 3,500 images (approximately 1,167 images per class) - Test set: 1,050 images (350 images per class)

The dataset is organized in a structured format with separate directories for: 1. Anaphalis Javanica 2. Leontopodium Alpinum 3. Leucogenes Grandiceps

Each image in the dataset has been carefully prepared to ensure consistency and quality for machine learning applications. The balanced distribution between classes helps prevent bias during model training.

Applications

Species classification and identification

Computer vision model development

Educational purposes in botany and biodiversity studies

Benchmarking machine learning algorithms

The dataset's clean split between training and test sets makes it ideal for developing and evaluating classification models while following machine learning best practices.
SpaceNet: A Comprehensive Astronomical Dataset
kaggle.com
zip
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raza Imam (2024). SpaceNet: A Comprehensive Astronomical Dataset [Dataset]. https://www.kaggle.com/datasets/razaimam45/spacenet-an-optimally-distributed-astronomy-data
Explore at:
zip(56552989870 bytes)Available download formats
Dataset updated
Aug 30, 2024
Authors
Raza Imam
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Description:

SpaceNet, attained via a novel double-stage augmentation framework: FLARE https://arxiv.org/pdf/2405.13267, is a hierarchically structured and high-quality astronomical image dataset designed for fine-grained and macro classification tasks. Comprising approximately 12,900 samples, SpaceNet integrates lower (LR) to higher resolution (HR) conversion with standard augmentations and a diffusion approach for synthetic sample generation. This dataset enables superior generalization on various recogntion tasks like classification.

Dataset Structure

Fine-Grained Classes: 8 classes including planets, galaxies, asteroids, nebulae, comets, black holes, stars, and constellations.

Dataset Composition

Total Samples: Approximately 12,900 images. Fine-Grained Class Distribution: - Asteroid: 283 files - Black Hole: 656 files - Comet: 416 files - Constellation: 1,552 files - Galaxy: 3,984 files - Nebula: 1,192 files - Planet: 1,472 files - Star: 3,269 files

Usage

SpaceNet is suitable for:

Training and evaluating machine learning models on fine-grained and macro astronomical classification tasks.

Research on hierarchical classification approaches in the astronomy domain.

Developing robust models that generalize well across in-domain and out-of-domain datasets.

Citation

If you use SpaceNet in your research, please cite it as follows: python @misc{alamimam2024flare, title={FLARE up your data: Diffusion-based Augmentation Method in Astronomical Imaging}, author={Mohammed Talha Alam and Raza Imam and Mohsen Guizani and Fakhri Karray}, year={2024}, eprint={2405.13267}, archivePrefix={arXiv}, primaryClass={cs.CV} }
Distributed Digital Learning Student Dataset
kaggle.com
zip
Updated Nov 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zyan1999 (2025). Distributed Digital Learning Student Dataset [Dataset]. https://www.kaggle.com/datasets/zyan1999/distributed-digital-learning-student-dataset
Explore at:
zip(48163 bytes)Available download formats
Dataset updated
Nov 10, 2025
Authors
zyan1999
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset consists of 2,500 student records collected from multiple institutions, capturing demographic information, learning habits, and engagement metrics. Each record includes features such as age, gender, study hours per week, attendance rate, assignment and quiz scores, participation score, internet access quality, and frequency of resource usage. The target column, final_grade, categorizes student performance as High, Medium, or Low. Designed to support research on distributed digital learning systems, this dataset enables analysis of multi-institutional collaboration, personalized learning, and performance prediction while preserving student and institutional privacy.

Column Description:

student_id: A unique identifier for each student.

institution_id: The institution or organization to which the student belongs.

age: The student’s age in years.

gender: The student’s gender (Male, Female, or Other).

study_hours_per_week: Average number of hours the student spends studying weekly.

attendance_rate: Percentage of classes attended by the student.

assignment_score: Average score obtained by the student on assignments (0–100).

quiz_score: Average score obtained by the student on quizzes (0–100).

participation_score: Level of engagement in class discussions or activities (0–100).

internet_access_quality: Rating of the student’s internet connection quality (1–5).

resource_access_frequency: Number of times the student accesses learning resources per week.

final_grade: Overall performance category of the student (High, Medium, or Low).
Student Performance Dataset
kaggle.com
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghulam Muhammad Nabeel (2025). Student Performance Dataset [Dataset]. https://www.kaggle.com/datasets/nabeelqureshitiii/student-performance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghulam Muhammad Nabeel
Description
📊 Student Performance Dataset (Synthetic, Realistic)

Overview

This dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.

Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.

🔑 Columns Description

student_id → Unique identifier for each student.

weekly_self_study_hours → Average weekly self-study hours (0–40). Generated using a normal distribution centered around 15 hours.

attendance_percentage → Attendance percentage (50–100). Simulated with a normal distribution around 85%.

class_participation → Score between 0–10 indicating how actively the student participates in class. Generated from a normal distribution centered around 6.

total_score → Final performance score (0–100). Calculated as a function of study hours + random noise, then clipped between 0–100. Stronger correlation with study hours.

grade → Categorical label (A, B, C, D, F) derived from total_score.

📐 Data Generation Logic

Weekly Study Hours: Modeled using a normal distribution (mean ≈ 15, std ≈ 7), capped between 0 and 40 hours.

Scores: More study hours → higher score. Formula:

Random noise simulates differences in learning ability, motivation, etc.

Attendance & Participation: Independent but realistic variations added.

Grades: Assigned from scores using thresholds:

A: ≥ 85

B: ≥ 70

C: ≥ 55

D: ≥ 40

F: < 40

🎯 How to Use This Dataset

Regression Tasks

Predict total_score from weekly_self_study_hours.

Train and evaluate Linear Regression models.

Extend to multiple regression using attendance_percentage and class_participation.

Classification Tasks

Predict grade (A–F) using study hours, attendance, and participation.

Model Evaluation Practice

Apply train-test split and cross-validation.

Evaluate with MAE, RMSE, R².

Compare simple vs. multiple regression.

✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).
Garbage Dataset
kaggle.com
zip
Updated Dec 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suman Kunwar (2024). Garbage Dataset [Dataset]. https://www.kaggle.com/datasets/sumn2u/garbage-classification-v2
Explore at:
zip(780289207 bytes)Available download formats
Dataset updated
Dec 12, 2024
Authors
Suman Kunwar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains images of garbage items categorized into 10 classes, designed for machine learning and computer vision projects focusing on recycling and waste management. It is ideal for building classification or object detection models or developing AI-powered solutions for sustainable waste disposal.

Dataset Summary

The dataset features 10 distinct classes of garbage with a total of 19,762 images, distributed as follows:

Metal: 1020

Glass: 3061

Biological: 997

Paper: 1680

Battery: 944

Trash: 947

Cardboard: 1825

Shoes: 1977

Clothes: 5327

Plastic: 1984

Key Features - Diverse Categories: Covers common household waste items for a wide range of applications. - Balanced Distribution: Each class is sufficiently populated, ensuring robust model training. - High-Quality Images: Clear and well-annotated images for better performance in computer vision tasks. - Real-World Applications: Ideal for building recycling solutions, waste segregation apps, and educational tools.

Academic Reference The dataset was featured in the research paper, "Managing Household Waste Through Transfer Learning", showcasing its utility in real-world applications. Researchers and developers can replicate or extend the experiments for further studies.

Applications - AI for Sustainability: Train AI models to classify garbage and promote automated waste management. - Recycling Programs: Build systems to sort garbage into recyclable and non-recyclable materials. - Environmental Education: Develop tools to teach kids and adults about proper waste disposal.

Feedbacks

Thank you for your interest in our waste dataset. Whether you have used the dataset or are considering its use, your feedback is crucial to help us understand your needs and improve the dataset. Please take a few minutes to share your thoughts and experiences through this feedback form. Your input is greatly appreciated.

We also welcome feedback and contributions to our project on GitHub. Your suggestions and collaboration can help us enhance the dataset and improve the model's performance. Let's work together to make a positive difference!
avila_dataset
kaggle.com
zip
Updated May 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HRITABAN GHOSH (2022). avila_dataset [Dataset]. https://www.kaggle.com/datasets/hritaban02/avila-dataset
Explore at:
zip(604026 bytes)Available download formats
Dataset updated
May 10, 2022
Authors
HRITABAN GHOSH
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is made from the Avila dataset obtained from the UCI Machine Learning Repository. Here is the description of the data from the above source:

Data Set Information:

Data have been normalized by using the Z-normalization method and divided into two data sets: a training set containing 10430 samples, and a test set containing the 10437 samples.

CLASS DISTRIBUTION (training set) A: 4286 B: 5 C: 103 D: 352 E: 1095 F: 1961 G: 446 H: 519 I: 831 W: 44 X: 522 Y: 266

Attribute Information:

F1: intercolumnar distance F2: upper margin F3: lower margin F4: exploitation F5: row number F6: modular ratio F7: interlinear spacing F8: weight F9: peak number F10: modular ratio/ interlinear spacing Class: A, B, C, D, E, F, G, H, I, W, X, Y
Dog vs Cat
kaggle.com
Updated Sep 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AnthonyTherrien (2024). Dog vs Cat [Dataset]. http://doi.org/10.34740/kaggle/dsv/9498291
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9498291
Dataset updated
Sep 28, 2024
Dataset provided by
Kaggle
Authors
AnthonyTherrien
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Overview

This dataset contains a total of 1000 images, with an equal distribution of 500 images of dog and 500 images of cat. The images are standardized to a resolution of 512x512 pixels.

Details

Total Images: 1000

Dog: 500 images

Cat: 500 images

Image Resolution: 512x512 pixels

File Format: .png

Source: Images generated using Stable Diffusion 1.5

Usage

This dataset is ideal for tasks such as: - Binary classification - Image recognition and processing - Machine learning and deep learning model training
Car vs Bike Classification Dataset
kaggle.com
gts.ai
zip
Updated Oct 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepNets (2022). Car vs Bike Classification Dataset [Dataset]. https://www.kaggle.com/datasets/utkarshsaxenadn/car-vs-bike-classification-dataset/code
Explore at:
zip(107824115 bytes)Available download formats
Dataset updated
Oct 28, 2022
Authors
DeepNets
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data set is a collection of 2,000 Bike and Car images. While collecting these images, It was made sure that all types of bikes and cars are included in the image collection. This is because of the high Intra-variety of cars and bikes. That is, there are different types of cars and bikes, which make it a little tough task for the model because the model will also have to understand the high variety of bikes and cars. But if your model is able to understand the basic structure of a car and a bike, it will be able to distinguish between both classes.

The data is not preprocessed. This is done intentionally so that you can apply the augmentations you want to use. Almost all the 2000 images are unique. So after applying some data augmentation, you can increase the size of the data set.

The data is not distributed into training and validation subsets. But you can easily do so by using an Image data generator from Keras. The preprocessing steps are available in the my notebook associated with this data set. You can practice your computer vision skills using this data set. This is a binary classification task.
Phishing URL Content Dataset
kaggle.com
zip
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaditey Pillai (2024). Phishing URL Content Dataset [Dataset]. https://www.kaggle.com/datasets/aaditeypillai/phishing-website-content-dataset
Explore at:
zip(62701 bytes)Available download formats
Dataset updated
Nov 25, 2024
Authors
Aaditey Pillai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Phishing URL Content Dataset

Executive Summary

Motivation:
Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.

Applications:
- Building robust phishing detection systems.
- Enhancing security measures in email filtering and web browsing.
- Training cybersecurity practitioners in identifying malicious URLs.

The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.

Description of Data

This dataset comprises two types of URLs:
1. Phishing URLs: Malicious URLs designed to deceive users. 2. Benign URLs: Legitimate URLs posing no harm to users.

Key Features:
- URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
- Content-based features: Link density, iframe presence, external/internal links, and metadata.
- Certificate-based features: SSL/TLS details like validity period and organization.
- WHOIS data: Registration details like creation and expiration dates.

Statistics:
- Total Samples: 800 (400 phishing, 400 benign).
- Features: 22 including URL, domain, link density, and SSL attributes.

Power Analysis

To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.

Exploratory Data Analysis (EDA)

Insights from EDA:
- Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts. - Bar Plots: Class distribution and protocol usage trends. - Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns. - Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.

EDA visualizations are provided in the repository.

Link to Publicly Available Data and Code

Dataset: Phishing URL Dataset

Code Repository: GitHub - Phishing Detection

The repository contains the Python code used to extract features, conduct EDA, and build the dataset.

Ethics Statement

Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
1. Protects User Privacy: No personally identifiable information is included.
2. Promotes Ethical Use: Intended solely for academic and research purposes.
3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.

Risks:
- Misuse of the dataset for creating more deceptive phishing attacks.
- Over-reliance on outdated features as phishing tactics evolve.

Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.

Open Source License

This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.
Federated Health Records Dataset
kaggle.com
zip
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2025). Federated Health Records Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/federated-health-records-dataset
Explore at:
zip(67310 bytes)Available download formats
Dataset updated
May 15, 2025
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset, titled "Federated Health Records for Privacy-Preserving AI Research," is a healthcare dataset designed to support research and experimentation in Federated Learning (FL) and Homomorphic Encryption (HE) for secure artificial intelligence applications.

Each record represents a simulated patient's health profile, including key features such as age, BMI, blood pressure, glucose and insulin levels, physical activity, and diet quality. The dataset is partitioned by client_id, simulating data distributed across multiple hospitals or mobile devices, where direct data sharing is restricted due to privacy concerns.

The target variable, risk_of_diabetes, is a binary indicator derived from a logistic function applied to health metrics, helping researchers model classification tasks in a privacy-aware environment.

💡 Key Features Federated-ready: Labeled by client_id to simulate decentralized data sources.

Privacy-focused: Supports homomorphic encryption-based model updates.

Flexible use: Suitable for classification, secure model aggregation, and robustness testing.
dataset of SMC
kaggle.com
zip
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tan jingwen999 (2024). dataset of SMC [Dataset]. https://www.kaggle.com/datasets/tanjingwen999/dataset-of-smc
Explore at:
zip(300604211 bytes)Available download formats
Dataset updated
May 24, 2024
Authors
Tan jingwen999
Description
This is the dataset used in "An Adaptability-Enhanced Few-Shot Website Fingerprinting Attack Based on Collusion", which consists of four following datasets. [1] V. Rimmer, D. Preuveneers, M. Juarez, T. Van Goethem, and W. Joosen, “Automated website fingerprinting through deep learning,” Network and Distributed System Security Symposium, 2017. [2] P. Sirinam, M. Imani, M. Juarez, and M. Wright, “Deep fingerprinting: Undermining website fingerprinting defenses with deep learning,” in Proceedings of the 2018 ACM Conference on Computer and Communications Security, 2018, pp. 1928–1943. [3] T. Wang, X. Cai, R. Nithyanand, R. Johnson, and I. Goldberg, “Effective attacks and provable defenses for website fingerprinting,” in 23rd USENIX Security Symposium, 2014, pp. 143–157. [4] J. Gong and T. Wang, “Zero-delay lightweight defenses against website fingerprinting,” in 29th USENIX Security Symposium, 2020, pp. 717–734.
Software Engineering Interview Questions Dataset
kaggle.com
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
syedmharis (2023). Software Engineering Interview Questions Dataset [Dataset]. https://www.kaggle.com/datasets/syedmharis/software-engineering-interview-questions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 21, 2023
Dataset provided by
Kaggle
Authors
syedmharis
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Comprehensive Software Engineering Interview Questions Dataset

Description: Overview This dataset is an extensive collection of software engineering interview questions, designed to mirror the complexity and depth of questions asked in interviews at top tech companies, including FAANG (Facebook, Amazon, Apple, Netflix, Google). It encompasses a wide range of topics, from algorithms and data structures to system design and machine learning. The dataset is curated to assist candidates in preparing for technical interviews and to provide educators and interviewers with a resource for assessing technical skills.

Dataset Details Number of Questions: 250 Categories Covered: Algorithms, System Design, Machine Learning, Data Structures, Distributed Systems, Networking, Low-level Systems, Security, Database Systems, Artificial Intelligence, Data Engineering. Difficulty Level: Primarily Hard. Format: The dataset is structured in a tabular format with columns for Question Number, Question, Brief Answer, Category, and Difficulty. Usage Scenarios: Interview preparation for candidates, educational resource for learning advanced software engineering concepts, tool for interviewers to structure technical assessments. Potential Analysis Users can perform various analyses, such as:

Category-wise Distribution: Understand the focus areas in software engineering roles. Difficulty Analysis: Gauge the complexity level of questions typically asked in high-end tech interviews. Trend Analysis: Identify trends in technical questions over recent years, especially in rapidly evolving fields like Machine Learning and AI. Inspiration This dataset is intended to inspire:

Job Candidates: To prepare comprehensively for technical interviews. Educators: To structure curriculum or coursework around practical, interview-oriented learning. Researchers: To analyze trends in technical interviews and skill requirements in the tech industry. Interviewers/Hiring Managers: To formulate effective interview strategies and questionnaires.
Student Learning Methods: A Survey
kaggle.com
zip
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praneth P (2025). Student Learning Methods: A Survey [Dataset]. https://www.kaggle.com/datasets/pranethp/student-learning-methods-a-survey
Explore at:
zip(20219 bytes)Available download formats
Dataset updated
Apr 3, 2025
Authors
Praneth P
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Student Learning Methods: A Survey - Dataset Description The Student Learning Methods: A Survey dataset comprises responses from 100 university students, with 10 participants surveyed from each of 10 different universities. This dataset explores students' preferences and evaluations of various learning methods based on effectiveness and engagement.

Key Features of the Dataset: Survey Scope:

Responses collected from 100 students.

Participants evenly distributed across 10 different universities (10 students per university).

Learning Methods Evaluated:

The dataset includes ratings for various learning techniques, such as:

Lectures – Traditional classroom-based teaching.

Case Studies – Analyzing real-world scenarios to understand concepts.

Group Projects – Collaborative assignments involving multiple students.

Experiments – Hands-on practical work in labs or controlled settings.

Online Tutorials – Digital or video-based instructional materials.

Evaluation Criteria:

Each learning method is rated on a numerical scale based on:

Effectiveness – How well students believe the method helps in learning.

Engagement – How interesting or interactive the method is perceived to be.

Secondary Evaluations:

The dataset includes repeated columns for learning methods, potentially representing:

Post-survey reflections where students reassessed their initial responses.

Comparative evaluations of different methods after exposure to multiple approaches.

Overall Effectiveness and Engagement Scores:

Each student provides aggregate scores summarizing how useful and engaging they found different learning methods overall.

Potential Use Cases:

Educational Research – Understanding which teaching techniques are most effective across universities.

Curriculum Development – Helping educators refine teaching strategies.

Student-Centric Learning Models – Identifying preferred methods to enhance student engagement.

Comparative Analysis – Examining how student preferences vary across universities. Survey Scope: Responses collected from 100 students.

Participants evenly distributed across 10 different universities (10 students per university).

The surveyed universities include:

Delhi University (DU) – A large central university in Delhi.

Jawaharlal Nehru University (JNU) – A well-known research-focused university in Delhi.

Banaras Hindu University (BHU) – A prestigious university in Varanasi, Uttar Pradesh.

Aligarh Muslim University (AMU) – A renowned university in Aligarh, Uttar Pradesh.

Chandigarh University – A fast-growing private university in Punjab.

Kurukshetra University – A public university in Haryana.

Himachal Pradesh University (HPU) – A state university in Shimla, Himachal Pradesh.

Guru Gobind Singh Indraprastha University (GGSIPU) – A Delhi-based state university.

Dr. B. R. Ambedkar University, Agra – A public university in Uttar Pradesh.

Uttarakhand Technical University (UTU) – A state technical university in Uttarakhand.

This dataset offers valuable insights into student learning preferences, enabling researchers and educators to tailor teaching methods for maximum impact.

Recently Updated Version
ImageNet-R
kaggle.com
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
my1nonly (2025). ImageNet-R [Dataset]. https://www.kaggle.com/datasets/my1nonly/imagenet-r
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
my1nonly
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ImageNet-R (ImageNet-Renditions) is a variant of the original ImageNet dataset designed to evaluate the robustness and generalization of image classification models to out-of-distribution (OOD) data. It contains 30,000 images corresponding to 200 classes from ImageNet, but instead of natural photographs, the images are renditions—such as sketches, paintings, cartoons, embroidery, and clay sculptures—that significantly differ in texture and appearance from the original training distribution.

ImageNet-R serves as a benchmark for assessing how well models trained on standard ImageNet data perform when exposed to domain-shifted inputs, especially those involving non-naturalistic visual styles. It highlights the tendency of many deep learning models to rely heavily on texture rather than shape cues, thus revealing potential brittleness in real-world deployment scenarios.
Sin_captcha_images
kaggle.com
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
_SindiK_ (2025). Sin_captcha_images [Dataset]. https://www.kaggle.com/datasets/sindik/sin-captcha-images/code
Explore at:
zip(12334116 bytes)Available download formats
Dataset updated
May 7, 2025
Authors
_SindiK_
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Dataset Description

This dataset is intended for training and testing machine learning models for CAPTCHA recognition. It contains CAPTCHA images along with corresponding filenames, which represent the text displayed on the image.

Dataset Content

The dataset includes +1500 CAPTCHA images, each of which is associated with a filename that corresponds to the text on the image. The images are provided in standard formats.

Purpose

The goal of this dataset is to provide data for training machine learning models, including neural networks, for solving CAPTCHA recognition tasks. This dataset can be used for classification tasks and optical character recognition (OCR) challenges.

Dataset Structure

Images: Each file is a CAPTCHA image.

Filenames: Each image has a corresponding filename that represents the correct answer (the text shown in the CAPTCHA).

Example

Image: abc123.png

Answer: abc123

License

This dataset is distributed under the GPL-2 license. The GPL-2 license allows for the use, distribution, and modification of the dataset, with the condition that derivative works must also be distributed under the GPL-2 license. Users must also provide access to the source code if modifications are used to create new projects.

Data Sources and Credits

This dataset is based in part on CAPTCHA Data by alizahidraja. The original images from that dataset were used as a foundation, and additional custom CAPTCHA images have been added to expand and diversify the dataset.

This combination aims to provide a richer and more varied training set for machine learning models focused on CAPTCHA recognition.

Important Notes

When using CAPTCHA images generated by third-party services, ensure that you are not infringing on any copyrights and comply with all legal requirements.

This dataset is suitable for research and educational purposes. It can also be used for solving challenges in artificial intelligence and machine learning related to text recognition in images.

Facebook

Twitter

Click to copy link

Link copied

Cite

Camelia Ben Laamari (2021). Distributed Training with Kubeflow [Dataset]. https://www.kaggle.com/cameliabenlaamari/distributed-training-with-kubeflow

Distributed Training with Kubeflow

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 30, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Camelia Ben Laamari

Description

Dataset

This dataset was created by Camelia Ben Laamari

Clear search

Close search

Google apps

Main menu

Distributed Training with Kubeflow

Dataset

Contents

NVIDIA Apex

How to use

Context

Content

Acknowledgements

CYBRIA - Federated Learning Network Security - IoT

Distributed peer review anonymized dataset

Context

Content

Acknowledgements

Datasets for federated learning

Edelweiss Image Dataset

Context

Content

Applications

SpaceNet: A Comprehensive Astronomical Dataset

Description:

Dataset Structure

Dataset Composition

Usage

Citation

Distributed Digital Learning Student Dataset

Student Performance Dataset

📊 Student Performance Dataset (Synthetic, Realistic)

Overview

🔑 Columns Description

📐 Data Generation Logic

🎯 How to Use This Dataset

Garbage Dataset

Feedbacks

avila_dataset

Dog vs Cat

Overview

Details

Usage

Car vs Bike Classification Dataset

Phishing URL Content Dataset

Phishing URL Content Dataset

Executive Summary

Description of Data

Power Analysis

Exploratory Data Analysis (EDA)

Link to Publicly Available Data and Code

Ethics Statement

Open Source License

Federated Health Records Dataset

dataset of SMC

Software Engineering Interview Questions Dataset

Student Learning Methods: A Survey

ImageNet-R

Sin_captcha_images

Dataset Description

Dataset Content

Purpose

Dataset Structure

Example

License

Data Sources and Credits

Important Notes

Distributed Training with Kubeflow

Dataset

Contents