Facebook
TwitterThis dataset was created by Camelia Ben Laamari
Facebook
TwitterAdd this dataset to your notebook, then execute the following command in a new cell
!cd ../input/nvidia-apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
https://github.com/NVIDIA/apex/blob/master/README.md As of 14th April 2020.
NVIDIA Apex Photo by Cas Magee on Unsplash License
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
**CYBRIA - Pioneering Federated Learning for Privacy-Aware Cybersecurity with Brilliance ** Research study a federated learning framework for collaborative cyber threat detection without compromising confidential data. The decentralized approach trains models on local data distributed across clients and shares only intermediate model updates to generate an integrated global model.
**If you use this dataset and code or any herein modified part of it in any publication, please cite these papers: ** P. Thantharate and A. T, "CYBRIA - Pioneering Federated Learning for Privacy-Aware Cybersecurity with Brilliance," 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET), Boca Raton, FL, USA, 2023, pp. 56-61, doi: 10.1109/HONET59747.2023.10374608.
For any questions and research queries - please reach out via Email.
Key Objectives - Develop a federated learning framework called Cybria for collaborative cyber threat detection without compromising confidential data - Evaluate model performance for intrusion detection using the Bot-IoT dataset
Proposed Solutions - Designed a privacy-preserving federated learning architecture tailored for cybersecurity applications Implemented the Cybria model using TensorFlow Federated and Flower libraries - Employed a decentralized approach where models are trained locally on clients and only model updates are shared
Simulated Results - Cybria's federated model achieves 89.6% accuracy for intrusion detection compared to 81.4% for a centralized DNN The federated approach shows 8-10% better performance, demonstrating benefits of collaborative yet decentralized learning - Local models allow specialized learning tuned to each client's data characteristics
Conclusion - Preliminary results validate potential of federated learning to enhance cyber threat detection accuracy in a privacy-preserving manner - Detailed studies needed to optimize model architectures, hyperparameters, and federation strategies for large real-world deployments - Approach helps enable an ecosystem for collective security knowledge without increasing data centralization risks
References The implementation would follow the details provided in the original research paper: Thantharate and A. T,
"CYBRIA - Pioneering Federated Learning for Privacy-Aware Cybersecurity with Brilliance," 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET), Boca Raton, FL, USA, 2023, pp. 56-61, doi: 10.1109/HONET59747.2023.10374608.
Any additional external libraries or sources used would be properly cited.
Tags - Federated learning, privacy-preserving machine learning, collaborative cyber threat detection, decentralized model training, intermediate model updates, integrated global model, cybersecurity, data privacy, distributed computing, secure aggregation, model personalization, adversarial attacks, anomaly detection, network traffic analysis, malware classification, intrusion prevention, threat intelligence, edge computing, data minimization, differential privacy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
While ancient scientists often had patrons to fund their work, peer review of proposals for the allocation of resources is a foundation of modern science
This is the anonymized dataset obtained from the DPR Experiment run at ESO in Fall 2018
previous work available at 10.1038/s41550-020-1038-y
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Federated learning is to build machine learning models based on data sets that are distributed across multiple devices while preventing data leakage.(Q. Yang et al. 2019)
source:
smoking https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking license = CC0: Public Domain
heart https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset license = CC0: Public Domain
water https://www.kaggle.com/datasets/adityakadiwal/water-potability license = CC0: Public Domain
customer https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis license = CC0: Public Domain
insurance https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data license = CC0: Public Domain
credit https://www.kaggle.com/datasets/ajay1735/hmeq-data license = CC0: Public Domain
income https://www.kaggle.com/datasets/mastmustu/income license = CC0: Public Domain
machine https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification license: CC0: Public Domain
skin https://www.kaggle.com/datasets/saurabhshahane/lumpy-skin-disease-dataset license = Attribution 4.0 International (CC BY 4.0)
score https://www.kaggle.com/datasets/parisrohan/credit-score-classification?select=train.csv license = CC0: Public Domain
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Image classification is one of the fundamental tasks in computer vision and machine learning. High-quality datasets are crucial for training robust models that can accurately identify different species. This dataset focuses on three distinct species commonly found in mountainous regions, providing a balanced collection of images for both training and evaluation purposes.
This dataset contains 4,550 high-quality images distributed across three categories: - Training set: 3,500 images (approximately 1,167 images per class) - Test set: 1,050 images (350 images per class)
The dataset is organized in a structured format with separate directories for: 1. Anaphalis Javanica 2. Leontopodium Alpinum 3. Leucogenes Grandiceps
Each image in the dataset has been carefully prepared to ensure consistency and quality for machine learning applications. The balanced distribution between classes helps prevent bias during model training.
The dataset's clean split between training and test sets makes it ideal for developing and evaluating classification models while following machine learning best practices.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
SpaceNet, attained via a novel double-stage augmentation framework: FLARE https://arxiv.org/pdf/2405.13267, is a hierarchically structured and high-quality astronomical image dataset designed for fine-grained and macro classification tasks. Comprising approximately 12,900 samples, SpaceNet integrates lower (LR) to higher resolution (HR) conversion with standard augmentations and a diffusion approach for synthetic sample generation. This dataset enables superior generalization on various recogntion tasks like classification.
Total Samples: Approximately 12,900 images. Fine-Grained Class Distribution: - Asteroid: 283 files - Black Hole: 656 files - Comet: 416 files - Constellation: 1,552 files - Galaxy: 3,984 files - Nebula: 1,192 files - Planet: 1,472 files - Star: 3,269 files
SpaceNet is suitable for:
If you use SpaceNet in your research, please cite it as follows:
python
@misc{alamimam2024flare,
title={FLARE up your data: Diffusion-based Augmentation Method in Astronomical Imaging},
author={Mohammed Talha Alam and Raza Imam and Mohsen Guizani and Fakhri Karray},
year={2024},
eprint={2405.13267},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of 2,500 student records collected from multiple institutions, capturing demographic information, learning habits, and engagement metrics. Each record includes features such as age, gender, study hours per week, attendance rate, assignment and quiz scores, participation score, internet access quality, and frequency of resource usage. The target column, final_grade, categorizes student performance as High, Medium, or Low. Designed to support research on distributed digital learning systems, this dataset enables analysis of multi-institutional collaboration, personalized learning, and performance prediction while preserving student and institutional privacy.
Column Description:
student_id: A unique identifier for each student.
institution_id: The institution or organization to which the student belongs.
age: The student’s age in years.
gender: The student’s gender (Male, Female, or Other).
study_hours_per_week: Average number of hours the student spends studying weekly.
attendance_rate: Percentage of classes attended by the student.
assignment_score: Average score obtained by the student on assignments (0–100).
quiz_score: Average score obtained by the student on quizzes (0–100).
participation_score: Level of engagement in class discussions or activities (0–100).
internet_access_quality: Rating of the student’s internet connection quality (1–5).
resource_access_frequency: Number of times the student accesses learning resources per week.
final_grade: Overall performance category of the student (High, Medium, or Low).
Facebook
TwitterThis dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.
Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.
Random noise simulates differences in learning ability, motivation, etc.
Regression Tasks
total_score from weekly_self_study_hours. attendance_percentage and class_participation. Classification Tasks
grade (A–F) using study hours, attendance, and participation. Model Evaluation Practice
✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains images of garbage items categorized into 10 classes, designed for machine learning and computer vision projects focusing on recycling and waste management. It is ideal for building classification or object detection models or developing AI-powered solutions for sustainable waste disposal.
Dataset Summary
The dataset features 10 distinct classes of garbage with a total of 19,762 images, distributed as follows:
Key Features - Diverse Categories: Covers common household waste items for a wide range of applications. - Balanced Distribution: Each class is sufficiently populated, ensuring robust model training. - High-Quality Images: Clear and well-annotated images for better performance in computer vision tasks. - Real-World Applications: Ideal for building recycling solutions, waste segregation apps, and educational tools.
Academic Reference The dataset was featured in the research paper, "Managing Household Waste Through Transfer Learning", showcasing its utility in real-world applications. Researchers and developers can replicate or extend the experiments for further studies.
Applications - AI for Sustainability: Train AI models to classify garbage and promote automated waste management. - Recycling Programs: Build systems to sort garbage into recyclable and non-recyclable materials. - Environmental Education: Develop tools to teach kids and adults about proper waste disposal.
Thank you for your interest in our waste dataset. Whether you have used the dataset or are considering its use, your feedback is crucial to help us understand your needs and improve the dataset. Please take a few minutes to share your thoughts and experiences through this feedback form. Your input is greatly appreciated.
We also welcome feedback and contributions to our project on GitHub. Your suggestions and collaboration can help us enhance the dataset and improve the model's performance. Let's work together to make a positive difference!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is made from the Avila dataset obtained from the UCI Machine Learning Repository. Here is the description of the data from the above source:
Data Set Information:
Data have been normalized by using the Z-normalization method and divided into two data sets: a training set containing 10430 samples, and a test set containing the 10437 samples.
CLASS DISTRIBUTION (training set) A: 4286 B: 5 C: 103 D: 352 E: 1095 F: 1961 G: 446 H: 519 I: 831 W: 44 X: 522 Y: 266
Attribute Information:
F1: intercolumnar distance F2: upper margin F3: lower margin F4: exploitation F5: row number F6: modular ratio F7: interlinear spacing F8: weight F9: peak number F10: modular ratio/ interlinear spacing Class: A, B, C, D, E, F, G, H, I, W, X, Y
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains a total of 1000 images, with an equal distribution of 500 images of dog and 500 images of cat. The images are standardized to a resolution of 512x512 pixels.
This dataset is ideal for tasks such as: - Binary classification - Image recognition and processing - Machine learning and deep learning model training
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set is a collection of 2,000 Bike and Car images. While collecting these images, It was made sure that all types of bikes and cars are included in the image collection. This is because of the high Intra-variety of cars and bikes. That is, there are different types of cars and bikes, which make it a little tough task for the model because the model will also have to understand the high variety of bikes and cars. But if your model is able to understand the basic structure of a car and a bike, it will be able to distinguish between both classes.
The data is not preprocessed. This is done intentionally so that you can apply the augmentations you want to use. Almost all the 2000 images are unique. So after applying some data augmentation, you can increase the size of the data set.
The data is not distributed into training and validation subsets. But you can easily do so by using an Image data generator from Keras. The preprocessing steps are available in the my notebook associated with this data set. You can practice your computer vision skills using this data set. This is a binary classification task.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Motivation:
Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.
Applications:
- Building robust phishing detection systems.
- Enhancing security measures in email filtering and web browsing.
- Training cybersecurity practitioners in identifying malicious URLs.
The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.
This dataset comprises two types of URLs:
1. Phishing URLs: Malicious URLs designed to deceive users.
2. Benign URLs: Legitimate URLs posing no harm to users.
Key Features:
- URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
- Content-based features: Link density, iframe presence, external/internal links, and metadata.
- Certificate-based features: SSL/TLS details like validity period and organization.
- WHOIS data: Registration details like creation and expiration dates.
Statistics:
- Total Samples: 800 (400 phishing, 400 benign).
- Features: 22 including URL, domain, link density, and SSL attributes.
To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.
Insights from EDA:
- Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts.
- Bar Plots: Class distribution and protocol usage trends.
- Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns.
- Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.
EDA visualizations are provided in the repository.
The repository contains the Python code used to extract features, conduct EDA, and build the dataset.
Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
1. Protects User Privacy: No personally identifiable information is included.
2. Promotes Ethical Use: Intended solely for academic and research purposes.
3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.
Risks:
- Misuse of the dataset for creating more deceptive phishing attacks.
- Over-reliance on outdated features as phishing tactics evolve.
Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.
This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset, titled "Federated Health Records for Privacy-Preserving AI Research," is a healthcare dataset designed to support research and experimentation in Federated Learning (FL) and Homomorphic Encryption (HE) for secure artificial intelligence applications.
Each record represents a simulated patient's health profile, including key features such as age, BMI, blood pressure, glucose and insulin levels, physical activity, and diet quality. The dataset is partitioned by client_id, simulating data distributed across multiple hospitals or mobile devices, where direct data sharing is restricted due to privacy concerns.
The target variable, risk_of_diabetes, is a binary indicator derived from a logistic function applied to health metrics, helping researchers model classification tasks in a privacy-aware environment.
💡 Key Features Federated-ready: Labeled by client_id to simulate decentralized data sources.
Privacy-focused: Supports homomorphic encryption-based model updates.
Flexible use: Suitable for classification, secure model aggregation, and robustness testing.
Facebook
TwitterThis is the dataset used in "An Adaptability-Enhanced Few-Shot Website Fingerprinting Attack Based on Collusion", which consists of four following datasets. [1] V. Rimmer, D. Preuveneers, M. Juarez, T. Van Goethem, and W. Joosen, “Automated website fingerprinting through deep learning,” Network and Distributed System Security Symposium, 2017. [2] P. Sirinam, M. Imani, M. Juarez, and M. Wright, “Deep fingerprinting: Undermining website fingerprinting defenses with deep learning,” in Proceedings of the 2018 ACM Conference on Computer and Communications Security, 2018, pp. 1928–1943. [3] T. Wang, X. Cai, R. Nithyanand, R. Johnson, and I. Goldberg, “Effective attacks and provable defenses for website fingerprinting,” in 23rd USENIX Security Symposium, 2014, pp. 143–157. [4] J. Gong and T. Wang, “Zero-delay lightweight defenses against website fingerprinting,” in 29th USENIX Security Symposium, 2020, pp. 717–734.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Comprehensive Software Engineering Interview Questions Dataset
Description: Overview This dataset is an extensive collection of software engineering interview questions, designed to mirror the complexity and depth of questions asked in interviews at top tech companies, including FAANG (Facebook, Amazon, Apple, Netflix, Google). It encompasses a wide range of topics, from algorithms and data structures to system design and machine learning. The dataset is curated to assist candidates in preparing for technical interviews and to provide educators and interviewers with a resource for assessing technical skills.
Dataset Details Number of Questions: 250 Categories Covered: Algorithms, System Design, Machine Learning, Data Structures, Distributed Systems, Networking, Low-level Systems, Security, Database Systems, Artificial Intelligence, Data Engineering. Difficulty Level: Primarily Hard. Format: The dataset is structured in a tabular format with columns for Question Number, Question, Brief Answer, Category, and Difficulty. Usage Scenarios: Interview preparation for candidates, educational resource for learning advanced software engineering concepts, tool for interviewers to structure technical assessments. Potential Analysis Users can perform various analyses, such as:
Category-wise Distribution: Understand the focus areas in software engineering roles. Difficulty Analysis: Gauge the complexity level of questions typically asked in high-end tech interviews. Trend Analysis: Identify trends in technical questions over recent years, especially in rapidly evolving fields like Machine Learning and AI. Inspiration This dataset is intended to inspire:
Job Candidates: To prepare comprehensively for technical interviews. Educators: To structure curriculum or coursework around practical, interview-oriented learning. Researchers: To analyze trends in technical interviews and skill requirements in the tech industry. Interviewers/Hiring Managers: To formulate effective interview strategies and questionnaires.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Student Learning Methods: A Survey - Dataset Description The Student Learning Methods: A Survey dataset comprises responses from 100 university students, with 10 participants surveyed from each of 10 different universities. This dataset explores students' preferences and evaluations of various learning methods based on effectiveness and engagement.
Key Features of the Dataset: Survey Scope:
Responses collected from 100 students.
Participants evenly distributed across 10 different universities (10 students per university).
Learning Methods Evaluated:
The dataset includes ratings for various learning techniques, such as:
Lectures – Traditional classroom-based teaching.
Case Studies – Analyzing real-world scenarios to understand concepts.
Group Projects – Collaborative assignments involving multiple students.
Experiments – Hands-on practical work in labs or controlled settings.
Online Tutorials – Digital or video-based instructional materials.
Evaluation Criteria:
Each learning method is rated on a numerical scale based on:
Effectiveness – How well students believe the method helps in learning.
Engagement – How interesting or interactive the method is perceived to be.
Secondary Evaluations:
The dataset includes repeated columns for learning methods, potentially representing:
Post-survey reflections where students reassessed their initial responses.
Comparative evaluations of different methods after exposure to multiple approaches.
Overall Effectiveness and Engagement Scores:
Each student provides aggregate scores summarizing how useful and engaging they found different learning methods overall.
Potential Use Cases:
Educational Research – Understanding which teaching techniques are most effective across universities.
Curriculum Development – Helping educators refine teaching strategies.
Student-Centric Learning Models – Identifying preferred methods to enhance student engagement.
Comparative Analysis – Examining how student preferences vary across universities. Survey Scope: Responses collected from 100 students.
Participants evenly distributed across 10 different universities (10 students per university).
The surveyed universities include:
Delhi University (DU) – A large central university in Delhi.
Jawaharlal Nehru University (JNU) – A well-known research-focused university in Delhi.
Banaras Hindu University (BHU) – A prestigious university in Varanasi, Uttar Pradesh.
Aligarh Muslim University (AMU) – A renowned university in Aligarh, Uttar Pradesh.
Chandigarh University – A fast-growing private university in Punjab.
Kurukshetra University – A public university in Haryana.
Himachal Pradesh University (HPU) – A state university in Shimla, Himachal Pradesh.
Guru Gobind Singh Indraprastha University (GGSIPU) – A Delhi-based state university.
Dr. B. R. Ambedkar University, Agra – A public university in Uttar Pradesh.
Uttarakhand Technical University (UTU) – A state technical university in Uttarakhand.
This dataset offers valuable insights into student learning preferences, enabling researchers and educators to tailor teaching methods for maximum impact.
Recently Updated Version
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ImageNet-R (ImageNet-Renditions) is a variant of the original ImageNet dataset designed to evaluate the robustness and generalization of image classification models to out-of-distribution (OOD) data. It contains 30,000 images corresponding to 200 classes from ImageNet, but instead of natural photographs, the images are renditions—such as sketches, paintings, cartoons, embroidery, and clay sculptures—that significantly differ in texture and appearance from the original training distribution.
ImageNet-R serves as a benchmark for assessing how well models trained on standard ImageNet data perform when exposed to domain-shifted inputs, especially those involving non-naturalistic visual styles. It highlights the tendency of many deep learning models to rely heavily on texture rather than shape cues, thus revealing potential brittleness in real-world deployment scenarios.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset is intended for training and testing machine learning models for CAPTCHA recognition. It contains CAPTCHA images along with corresponding filenames, which represent the text displayed on the image.
The dataset includes +1500 CAPTCHA images, each of which is associated with a filename that corresponds to the text on the image. The images are provided in standard formats.
The goal of this dataset is to provide data for training machine learning models, including neural networks, for solving CAPTCHA recognition tasks. This dataset can be used for classification tasks and optical character recognition (OCR) challenges.
abc123.pngabc123This dataset is distributed under the GPL-2 license. The GPL-2 license allows for the use, distribution, and modification of the dataset, with the condition that derivative works must also be distributed under the GPL-2 license. Users must also provide access to the source code if modifications are used to create new projects.
This dataset is based in part on CAPTCHA Data by alizahidraja. The original images from that dataset were used as a foundation, and additional custom CAPTCHA images have been added to expand and diversify the dataset.
This combination aims to provide a richer and more varied training set for machine learning models focused on CAPTCHA recognition.
Facebook
TwitterThis dataset was created by Camelia Ben Laamari