https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Professor Enfuse & Learner: Comparing supervised and unsupervised learning strategies for real-world applications - Generated by Conversation Dataset Generator
This dataset was generated using the Conversation Dataset Generator script available at https://cahlen.github.io/conversation-dataset-generator/.
Generation Parameters
Number of Conversations Requested: 500 Number of Conversations Successfully Generated: 500 Total Turns: 6659 Model ID:… See the full description on the dataset page: https://huggingface.co/datasets/cahlen/cdg-AICourse-Level2-Sklearn.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Iris Species Dataset
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository. It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. The dataset is taken from UCI Machine Learning Repository's… See the full description on the dataset page: https://huggingface.co/datasets/scikit-learn/iris.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 2. Table of grid search parameters. Parameters are relevant to a scikit-learn implementation.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://qiangli.de/imgs/flowchart2%20(1).png">
An Explainable Visual Benchmark Dataset for Robustness Evaluation. A Dataset for Image Background Exploration!
Blur Background, Segmented Background, AI-generated Background, Bias of Tools During Annotation, Color in Background, Random Background with Real Environment
+⭐ Follow Authors for project updates.
Website: XimageNet-12
Here, we trying to understand how image background effect the Computer Vision ML model, on topics such as Detection and Classification, based on baseline Li et.al work on ICLR 2022: Explainable AI: Object Recognition With Help From Background, we are now trying to enlarge the dataset, and analysis the following topics: Blur Background / Segmented Background / AI generated Background/ Bias of tools during annotation/ Color in Background / Dependent Factor in Background/ LatenSpace Distance of Foreground/ Random Background with Real Environment! Ultimately, we also define the math equation of Robustness Scores! So if you feel interested How would we make it or join this research project? please feel free to collaborate with us!
In this paper, we propose an explainable visual dataset, XIMAGENET-12, to evaluate the robustness of visual models. XIMAGENET-12 consists of over 200K images with 15,410 manual semantic annotations. Specifically, we deliberately selected 12 categories from ImageNet, representing objects commonly encountered in practical life. To simulate real-world situations, we incorporated six diverse scenarios, such as overexposure, blurring, and color changes, etc. We further develop a quantitative criterion for robustness assessment, allowing for a nuanced understanding of how visual models perform under varying conditions, notably in relation to the background.
We employed a combination of tools and methodologies to generate the images in this dataset, ensuring both efficiency and quality in the annotation and synthesis processes.
For a detailed breakdown of our prompt engineering and hyperparameters, we invite you to consult our upcoming paper. This publication will provide comprehensive insights into our methodologies, enabling a deeper understanding of the image generation process.
this dataset has been/could be downloaded via Kaggl...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 3. Table of tuned models used by seqQscorer. Dataset: species-subset-layout, generic: all data. Feature sets: RAW (raw data), MAP (genome mapping), LOC (genomic localization), TSS (transcription start sites profile). Feature Selection: method-percentage (percentage of retained features), chi-square (chi2), recursive feature elimination (RFE). Algorithm Parameters: relevant to a scikit-learn implementation.
Data DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data
/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv
, validation_data.csv
, and test_data.csv
. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas
, numpy
, scikit-learn
, matplotlib
, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
This is a detailed description of the dataset, a data sheet for the dataset as proposed by Gebru et al.
Motivation for Dataset Creation Why was the dataset created? Embrapa ADD 256 (Apples by Drones Detection Dataset — 256 × 256) was created to provide images and annotation for research on *apple detection in orchards for UAV-based monitoring in apple production.
What (other) tasks could the dataset be used for? Apple detection in low-resolution scenarios, similar to the aerial images employed here.
Who funded the creation of the dataset? The building of the ADD256 dataset was supported by the Embrapa SEG Project 01.14.09.001.05.04, Image-based metrology for Precision Agriculture and Phenotyping, and FAPESP under grant (2017/19282-7).
Dataset Composition What are the instances? Each instance consists of an RGB image and an annotation describing apples locations as circular markers (i.e., presenting center and radius).
How many instances of each type are there? The dataset consists of 1,139 images containing 2,471 apples.
What data does each instance consist of? Each instance contains an 8-bits RGB image. Its corresponding annotation is found in the JSON files: each apple marker is composed by its center (cx, cy) and its radius (in pixels), as seen below:
"gebler-003-06.jpg": [ { "cx": 116, "cy": 117, "r": 10 }, { "cx": 134, "cy": 113, "r": 10 }, { "cx": 221, "cy": 95, "r": 11 }, { "cx": 206, "cy": 61, "r": 11 }, { "cx": 92, "cy": 1, "r": 10 } ],
Dataset.ipynb is a Jupyter Notebook presenting a code example for reading the data as a PyTorch's Dataset (it should be straightforward to adapt the code for other frameworks as Keras/TensorFlow, fastai/PyTorch, Scikit-learn, etc.)
Is everything included or does the data rely on external resources? Everything is included in the dataset.
Are there recommended data splits or evaluation measures? The dataset comes with specified train/test splits. The splits are found in lists stored as JSON files.
| | Number of images | Number of annotated apples | | --- | --- | --- | |Training | 1,025 | 2,204 | |Test | 114 | 267 | |Total | 1,139 | 2,471 |
Dataset recommended split.
Standard measures from the information retrieval and computer vision literature should be employed: precision and recall, F1-score and average precision as seen in COCO and Pascal VOC.
What experiments were initially run on this dataset? The first experiments run on this dataset are described in A methodology for detection and location of fruits in apples orchards from aerial images by Santos & Gebler (2021).
Data Collection Process How was the data collected? The data employed in the development of the methodology came from two plots located at the Embrapa’s Temperate Climate Fruit Growing Experimental Station at Vacaria-RS (28°30’58.2”S, 50°52’52.2”W). Plants of the varieties Fuji and Gala are present in the dataset, in equal proportions. The images were taken during December 13, 2018, by an UAV (DJI Phantom 4 Pro) that flew over the rows of the field at a height of 12 m. The images mix nadir and non-nadir views, allowing a more extensive view of the canopies. A subset from the images was random selected and 256 × 256 pixels patches were extracted.
Who was involved in the data collection process? T. T. Santos and L. Gebler captured the images in field. T. T. Santos performed the annotation.
How was the data associated with each instance acquired? The circular markers were annotated using the VGG Image Annotator (VIA).
WARNING: Find non-ripe apples in low-resolution images of orchards is a challenging task even for humans. ADD256 was annotated by a single annotator. So, users of this dataset should consider it a noisy dataset.
Data Preprocessing What preprocessing/cleaning was done? No preprocessing was applied.
Dataset Distribution How is the dataset distributed? The dataset is available at GitHub.
When will the dataset be released/first distributed? The dataset was released in October 2021.
What license (if any) is it distributed under? The data is released under Creative Commons BY-NC 4.0 (Attribution-NonCommercial 4.0 International license). There is a request to cite the corresponding paper if the dataset is used. For commercial use, contact Embrapa Agricultural Informatics business office.
Are there any fees or access/export restrictions? There are no fees or restrictions. For commercial use, contact Embrapa Agricultural Informatics business office.
Dataset Maintenance Who is supporting/hosting/maintaining the dataset? The dataset is hosted at Embrapa Agricultural Informatics and all comments or requests can be sent to Thiago T. Santos (maintainer).
Will the dataset be updated? There is no scheduled updates.
If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? Contributors should contact the maintainer by e-mail.
No warranty The maintainers and their institutions are exempt from any liability, judicial or extrajudicial, for any losses or damages arising from the use of the data contained in the image database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.
Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.
Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME?
in rating columns). It was then split into subsets for training, validation, and testing the model.
Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv
) and contains the following columns:
place_or_event_id: Unique identifier for each tourist place or event.
rating: Rating given by the user, ranging from 1 to 5.
The data is split into three subsets:
Training Set: 80% of the dataset used to train the model.
Validation Set: A small portion used for hyperparameter tuning.
Test Set: 20% used to evaluate model performance.
Folder and File Naming Conventions:
The dataset files are stored in the following structure:
user_ratings_dataset.csv
: The original dataset file containing user ratings.
tour_recommendation_model.pkl
: The saved model after training.
actual_vs_predicted_chart.png
: A chart comparing actual and predicted ratings.
Software Requirements:
To open and work with this dataset, the following software and libraries are required:
Python 3.x
Pandas for data manipulation
Scikit-learn for training and evaluating machine learning models
Matplotlib for chart generation
Joblib for saving and loading the trained model
The dataset can be opened and processed using any Python environment that supports these libraries.
Additional Resources:
The model training code, README file, and performance chart are available in the project repository.
For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).
Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:
Train other types of models (e.g., regression, classification).
Experiment with different features or add more metadata to enrich the dataset.
Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME?
or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.
Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 5. Table of models tuned using the grid search. Dataset: species-subset-layout, generic: all data. Feature sets: RAW (raw data), MAP (genome mapping), LOC (genomic localization), TSS (transcription start sites profile). Feature Selection: method-percentage (percentage of retained features), chi-square (chi2), recursive feature elimination (RFE). Algorithm Parameters: relevant to a scikit-learn implementation.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size for Data Science and ML Platforms was estimated to be approximately USD 78.9 billion in 2023, and it is projected to reach around USD 307.6 billion by 2032, growing at a Compound Annual Growth Rate (CAGR) of 16.4% during the forecast period. This remarkable growth can be largely attributed to the increasing adoption of artificial intelligence (AI) and machine learning (ML) across various industries to enhance operational efficiency, predictive analytics, and decision-making processes.
The surge in big data and the necessity to make sense of unstructured data is a substantial growth driver for the Data Science and ML Platforms market. Organizations are increasingly leveraging data science and machine learning to gain insights that can help them stay competitive. This is especially true in sectors like retail and e-commerce where customer behavior analytics can lead to more targeted marketing strategies, personalized shopping experiences, and improved customer retention rates. Additionally, the proliferation of IoT devices is generating massive amounts of data, which further fuels the need for advanced data analytics platforms.
Another significant growth factor is the increasing adoption of cloud-based solutions. Cloud platforms offer scalable resources, flexibility, and substantial cost savings, making them attractive for enterprises of all sizes. Cloud-based data science and machine learning platforms also facilitate collaboration among distributed teams, enabling more efficient workflows and faster time-to-market for new products and services. Furthermore, advancements in cloud technologies, such as serverless computing and containerization, are making it easier for organizations to deploy and manage their data science models.
Investment in AI and ML by key industry players also plays a crucial role in market growth. Tech giants like Google, Amazon, Microsoft, and IBM are making substantial investments in developing advanced AI and ML tools and platforms. These investments are not only driving innovation but also making these technologies more accessible to smaller enterprises. Additionally, mergers and acquisitions in this space are leading to more integrated and comprehensive solutions, which are further accelerating market growth.
Machine Learning Tools are at the heart of this technological evolution, providing the necessary frameworks and libraries that empower developers and data scientists to create sophisticated models and algorithms. These tools, such as TensorFlow, PyTorch, and Scikit-learn, offer a range of functionalities from data preprocessing to model deployment, catering to both beginners and experts. The accessibility and versatility of these tools have democratized machine learning, enabling a wider audience to harness the power of AI. As organizations continue to embrace digital transformation, the demand for robust machine learning tools is expected to grow, driving further innovation and development in this space.
From a regional perspective, North America is expected to hold the largest market share due to the early adoption of advanced technologies and the presence of major market players. However, the Asia Pacific region is anticipated to exhibit the highest growth rate during the forecast period. This is driven by increasing investments in AI and ML, a burgeoning start-up ecosystem, and supportive government policies aimed at digital transformation. Countries like China, India, and Japan are at the forefront of this growth, making significant strides in AI research and application.
When analyzing the Data Science and ML Platforms market by component, it's essential to differentiate between software and services. The software segment includes platforms and tools designed for data ingestion, processing, visualization, and model building. These software solutions are crucial for organizations looking to harness the power of big data and machine learning. They provide the necessary infrastructure for data scientists to develop, test, and deploy ML models. The software segment is expected to grow significantly due to ongoing advancements in AI algorithms and the increasing need for more sophisticated data analysis tools.
The services segment in the Data Science and ML Platforms market encompasses consulting, system integration, and support services. Consulting services help organizatio
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided scripts are designed to process hyperspectral images of construction and demolition waste (CDW) materials, extract relevant features, and train a machine-learning model for material classification. The scripts perform the following tasks:
Before running the scripts, ensure that you have the following:
numpy
matplotlib
scipy
pandas
scikit-learn
seaborn
rembg
(for background removal)Pillow
(PIL).mat
format containing calibrated hyperspectral cubes and wavelength information.hyperspectral_features_v2.py
This script processes individual hyperspectral image files to extract spectral features from a central subset of the image. It generates RGB images from the hyperspectral data, plots the mean reflectance spectra, and outputs a LaTeX-formatted table containing the extracted features.
.mat
files containing hyperspectral data from a specified input directory.Prepare Input Data:
.mat
files containing the hyperspectral data in the appropriate input directory (e.g., input/mortar
).Run the Script:
materials
list at the end of the script to include the materials you want to process (e.g., materials = ['mortar']
).Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The dataset in this study was processed and visualized using the pandas, matplotlib, and matminer libraries within the Scikit-learn framework. This dataset primarily includes data from the processes of machine learning model training, high-throughput material generation and screening, and first-principles calculation validation. The dataset consists of two main files: data_set and data of supplementary materials. The data_set file contains the original data (janus.csv), feature-engineered original data (janus_featured.csv), element-substituted data (prd_janus.csv), feature-engineered element-substituted data (prd_janus_featured.csv), and the model screening results (stable_high_magnetization.csv). The data of supplementary materials file includes Figure S1 and Table S1. Figure S1 presents an analysis of feature importance for lattice constants a = b, lattice constant c, formation energy, and magnetic moment categories during model training. Table S1 provides the formation energy and magnetic moment obtained from direct static self-consistent calculations for 13 unoptimized Janus structures.(1) janus.csv is the original dataset obtained from the Materials Project database, containing information on the chemical composition (elements and stoichiometry), crystal space group, lattice constants, formation energy, and total magnetic moment of 1,179 two-dimensional hexagonal ABC-type Janus materials.(2) janus_featured.csv is the dataset obtained by applying feature engineering based on elemental composition information to the original dataset.(3) prd_janus.csv is a dataset of 82,018 ABC-type two-dimensional Janus materials, not yet experimentally synthesized, generated by random substitution of elements A, B, and C from the periodic table based on the two-dimensional hexagonal ABC-type Janus structures in the original dataset.(4) prd_janus_featured.csv is the feature-engineered dataset of the element-substituted materials.(5) stable_high_magnetization.csv is the dataset obtained by applying a trained machine learning model to the feature-engineered element-substituted data, containing 4,204 Janus structures with lattice information, thermal stability, and high magnetic moment.
ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Water level detection sensor, pH Sensor Meter, Soil Moisture sensor, Heart Rate Sensor, Flame Sensor, ...etc.). However, we identify and analyze fourteen attacks related to IoT and IIoT connectivity protocols, which are categorized into five threats, including, DoS/DDoS attacks, Information gathering, Man in the middle attacks, Injection attacks, and Malware attacks. In addition, we extract features obtained from different sources, including alerts, system resources, logs, network traffic, and propose new 61 features with high correlations from 1176 found features. After processing and analyzing the proposed realistic cyber security dataset, we provide a primary exploratory data analysis and evaluate the performance of machine learning approaches (i.e., traditional machine learning as well as deep learning) in both centralized and federated learning modes.
Instructions:
Great news! The Edge-IIoT dataset has been featured as a "Document in the top 1% of Web of Science." This indicates that it is ranked within the top 1% of all publications indexed by the Web of Science (WoS) in terms of citations and impact.
Please kindly visit kaggle link for the updates: https://www.kaggle.com/datasets/mohamedamineferrag/edgeiiotset-cyber-sec...
Free use of the Edge-IIoTset dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes is allowable after asking the leader author, Dr Mohamed Amine Ferrag, who has asserted his right under the Copyright.
The details of the Edge-IIoT dataset were published in following the paper. For the academic/public use of these datasets, the authors have to cities the following paper:
Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022 (IF: 3.37), DOI: 10.1109/ACCESS.2022.3165809
Link to paper : https://ieeexplore.ieee.org/document/9751703
The directories of the Edge-IIoTset dataset include the following:
•File 1 (Normal traffic)
-File 1.1 (Distance): This file includes two documents, namely, Distance.csv and Distance.pcap. The IoT sensor (Ultrasonic sensor) is used to capture the IoT data.
-File 1.2 (Flame_Sensor): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.
-File 1.3 (Heart_Rate): This file includes two documents, namely, Flame_Sensor.csv and Flame_Sensor.pcap. The IoT sensor (Flame Sensor) is used to capture the IoT data.
-File 1.4 (IR_Receiver): This file includes two documents, namely, IR_Receiver.csv and IR_Receiver.pcap. The IoT sensor (IR (Infrared) Receiver Sensor) is used to capture the IoT data.
-File 1.5 (Modbus): This file includes two documents, namely, Modbus.csv and Modbus.pcap. The IoT sensor (Modbus Sensor) is used to capture the IoT data.
-File 1.6 (phValue): This file includes two documents, namely, phValue.csv and phValue.pcap. The IoT sensor (pH-sensor PH-4502C) is used to capture the IoT data.
-File 1.7 (Soil_Moisture): This file includes two documents, namely, Soil_Moisture.csv and Soil_Moisture.pcap. The IoT sensor (Soil Moisture Sensor v1.2) is used to capture the IoT data.
-File 1.8 (Sound_Sensor): This file includes two documents, namely, Sound_Sensor.csv and Sound_Sensor.pcap. The IoT sensor (LM393 Sound Detection Sensor) is used to capture the IoT data.
-File 1.9 (Temperature_and_Humidity): This file includes two documents, namely, Temperature_and_Humidity.csv and Temperature_and_Humidity.pcap. The IoT sensor (DHT11 Sensor) is used to capture the IoT data.
-File 1.10 (Water_Level): This file includes two documents, namely, Water_Level.csv and Water_Level.pcap. The IoT sensor (Water sensor) is used to capture the IoT data.
•File 2 (Attack traffic):
-File 2.1 (Attack traffic (CSV files)): This file includes 13 documents, namely, Backdoor_attack.csv, DDoS_HTTP_Flood_attack.csv, DDoS_ICMP_Flood_attack.csv, DDoS_TCP_SYN_Flood_attack.csv, DDoS_UDP_Flood_attack.csv, MITM_attack.csv, OS_Fingerprinting_attack.csv, Password_attack.csv, Port_Scanning_attack.csv, Ransomware_attack.csv, SQL_injection_attack.csv, Uploading_attack.csv, Vulnerability_scanner_attack.csv, XSS_attack.csv. Each document is specific for each attack.
-File 2.2 (Attack traffic (PCAP files)): This file includes 13 documents, namely, Backdoor_attack.pcap, DDoS_HTTP_Flood_attack.pcap, DDoS_ICMP_Flood_attack.pcap, DDoS_TCP_SYN_Flood_attack.pcap, DDoS_UDP_Flood_attack.pcap, MITM_attack.pcap, OS_Fingerprinting_attack.pcap, Password_attack.pcap, Port_Scanning_attack.pcap, Ransomware_attack.pcap, SQL_injection_attack.pcap, Uploading_attack.pcap, Vulnerability_scanner_attack.pcap, XSS_attack.pcap. Each document is specific for each attack.
•File 3 (Selected dataset for ML and DL):
-File 3.1 (DNN-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating deep learning-based intrusion detection systems.
-File 3.2 (ML-EdgeIIoT-dataset): This file contains a selected dataset for the use of evaluating traditional machine learning-based intrusion detection systems.
Step 1: Downloading The Edge-IIoTset dataset From the Kaggle platform from google.colab import files
!pip install -q kaggle
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d mohamedamineferrag/edgeiiotset-cyber-security-dataset-of-iot-iiot -f "Edge-IIoTset dataset/Selected dataset for ML and DL/DNN-EdgeIIoT-dataset.csv"
!unzip DNN-EdgeIIoT-dataset.csv.zip
!rm DNN-EdgeIIoT-dataset.csv.zip
Step 2: Reading the Datasets' CSV file to a Pandas DataFrame: import pandas as pd
import numpy as np
df = pd.read_csv('DNN-EdgeIIoT-dataset.csv', low_memory=False)
Step 3 : Exploring some of the DataFrame's contents: df.head(5)
print(df['Attack_type'].value_counts())
Step 4: Dropping data (Columns, duplicated rows, NAN, Null..): from sklearn.utils import shuffle
drop_columns = ["frame.time", "ip.src_host", "ip.dst_host", "arp.src.proto_ipv4","arp.dst.proto_ipv4",
"http.file_data","http.request.full_uri","icmp.transmit_timestamp",
"http.request.uri.query", "tcp.options","tcp.payload","tcp.srcport",
"tcp.dstport", "udp.port", "mqtt.msg"]
df.drop(drop_columns, axis=1, inplace=True)
df.dropna(axis=0, how='any', inplace=True)
df.drop_duplicates(subset=None, keep="first", inplace=True)
df = shuffle(df)
df.isna().sum()
print(df['Attack_type'].value_counts())
Step 5: Categorical data encoding (Dummy Encoding): import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
def encode_text_dummy(df, name):
dummies = pd.get_dummies(df[name])
for x in dummies.columns:
dummy_name = f"{name}-{x}"
df[dummy_name] = dummies[x]
df.drop(name, axis=1, inplace=True)
encode_text_dummy(df,'http.request.method')
encode_text_dummy(df,'http.referer')
encode_text_dummy(df,"http.request.version")
encode_text_dummy(df,"dns.qry.name.len")
encode_text_dummy(df,"mqtt.conack.flags")
encode_text_dummy(df,"mqtt.protoname")
encode_text_dummy(df,"mqtt.topic")
Step 6: Creation of the preprocessed dataset df.to_csv('preprocessed_DNN.csv', encoding='utf-8')
For more information about the dataset, please contact the lead author of this project, Dr Mohamed Amine Ferrag, on his email: mohamed.amine.ferrag@gmail.com
More information about Dr. Mohamed Amine Ferrag is available at:
https://www.linkedin.com/in/Mohamed-Amine-Ferrag
https://dblp.uni-trier.de/pid/142/9937.html
https://www.researchgate.net/profile/Mohamed_Amine_Ferrag
https://scholar.google.fr/citations?user=IkPeqxMAAAAJ&hl=fr&oi=ao
https://www.scopus.com/authid/detail.uri?authorId=56115001200
https://publons.com/researcher/1322865/mohamed-amine-ferrag/
https://orcid.org/0000-0002-0632-3172
Last Updated: 27 Mar. 2023
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒
The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.
The provided data format is .jsonl
, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.
{ "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }
The data fields are:
text
: a string
feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).taxonomy
: a classification label, with possible values including informational
(0), invasion
(1), collection
(2), processing
(3), dissemination
(4), physical
(5), personal-space
(6), territoriality
(7), intrusion
(8), obtrusion
(9), contamination
(10), modesty
(11), psychological
(12), interrogation
(13), psychological-distance
(14), social
(15), association
(16), crowding-isolation
(17), public-gaze
(18), solitude
(19), intimacy
(20), anonymity
(21), reserve
(22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.category
: a classification label, with possible values including personal-information
(0), family
(1), health
(2), thoughts
(3), values
(4), acquaintance
(5), appointment
(6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.affected_speaker
: a classification label, with possible values including care-worker
(0), care-recipient
(1), other
(2), both
(3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.language
: a string
feature. Language code as defined by ISO 639.locale
: a string
feature. Regional code as defined by ISO 3166-1 alpha-2.data_type
: a string
a classification label, with possible values including real
(0), synthetic
(1).uid
: a int64
feature. A unique identifier within the dataset.split
: a string
feature. Either train
, validation
or test
.The dataset has 2 subsets:
split
: with a total of 95 examples split into train
, validation
and test
(70%-15%-15%)unsplit
: with a total of 95 examples in a single train splitname | train | validation | test |
---|---|---|---|
split | 66 | 14 | 15 |
unsplit | 95 | n/a | n/a |
The files follow the naming convention subset-split-language.jsonl
. The following files are contained in the dataset:
split-train-en.jsonl
split-validation-en.jsonl
split-test-en.jsonl
unsplit-train-en.jsonl
Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.
The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.
The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the sections were translated from German to U.S. English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split
function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank" rel="noopener
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.
The dataset includes 1,000 records across the following features:
Column Name | Description |
---|---|
Date | The date of the sale (01-01-2023 onward). |
Product_Category | Category of the product (e.g., Electronics, Sports, Other). |
Price | Price of the product (numerical). |
Discount | Discount applied to the product (numerical). |
Customer_Segment | Buyer segment (e.g., Regular, Occasional, Other). |
Marketing_Spend | Marketing budget allocated for sales (numerical). |
Units_Sold | Number of units sold per transaction (numerical). |
Date: - Range: 01-01-2023 to 12-31-2023. - Contains 1,000 unique values without missing data.
Product_Category: - Categories: Electronics (21%), Sports (21%), Other (58%). - Most common category: Electronics (21%).
Price: - Range: From 244 to 999. - Mean: 505, Standard Deviation: 290. - Most common price range: 14.59 - 113.07.
Discount: - Range: From 0.01% to 49.92%. - Mean: 24.9%, Standard Deviation: 14.4%. - Most common discount range: 0.01 - 5.00%.
Customer_Segment: - Segments: Regular (35%), Occasional (34%), Other (31%). - Most common segment: Regular.
Marketing_Spend: - Range: From 2.41k to 10k. - Mean: 4.91k, Standard Deviation: 2.84k.
Units_Sold: - Range: From 5 to 57. - Mean: 29.6, Standard Deviation: 7.26. - Most common range: 24 - 34 units sold.
The dataset is suitable for creating the following visualizations: - 1. Price Distribution: Histogram to show the spread of prices. - 2. Discount Distribution: Histogram to analyze promotional offers. - 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns. - 4. Customer Segment Distribution: Bar plot of customer segments. - 5. Price vs Units Sold: Scatter plot to show pricing effects on sales. - 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts. - 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness. - 8. Correlation Heatmap: Identify relationships between features. - 9. Pairplot: Visualize pairwise feature interactions.
The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:
Feature Engineering:
Data Simulation:
Validation:
Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.
Here’s an example of building a predictive model using Linear Regression:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the dataset
df = pd.read_csv('ecommerce_sales.csv')
# Feature selection
X = df[['Price', 'Discount', 'Marketing_Spend']]
y = df['Units_Sold']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is part of a machine learning project focused on predicting rainfall, a critical task for sectors like agriculture, water resource management, and disaster prevention. The project employs machine learning algorithms to forecast rainfall occurrences based on historical weather data, including features like temperature, humidity, and pressure.
The primary goal of the dataset is to train multiple machine learning models to predict rainfall and compare their performances. The insights gained will help identify the most accurate models for real-world predictions of rainfall events.
The dataset is derived from various historical weather observations, including temperature, humidity, wind speed, and pressure, collected by weather stations across Australia. These observations are used as inputs for training machine learning models. The dataset is publicly available on platforms like Kaggle and is often used in competitions and research to advance predictive analytics in meteorology.
The dataset consists of weather data from multiple Australian weather stations, spanning various time periods. Key features include:
Temperature
Humidity
Wind Speed
Pressure
Rainfall (target variable)
These features are tracked for each weather station over different times, with the goal of predicting rainfall.
Python: The primary programming language for data analysis and machine learning.
scikit-learn: For implementing machine learning models.
XGBoost, LightGBM, and CatBoost: Popular libraries for building more advanced ensemble models.
Matplotlib/Seaborn: For data visualization.
These libraries and tools help in data manipulation, modeling, evaluation, and visualization of results.
DBRepo Authorization: Required to access datasets via the DBRepo API for dataset retrieval.
Model Comparison Charts: The project includes output charts comparing the performance of seven popular machine learning models.
Trained Models (.pkl files): Pre-trained models are saved as .pkl files for reuse without retraining.
Documentation and Code: A Jupyter notebook guides through the process of data analysis, model training, and evaluation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 4. Table of optimal models used by seqQscorer. Optimal models are the best models for each data subset (Dataset column). Dataset: species-subset-layout, generic: all data. Feature sets: RAW (raw data), MAP (genome mapping), LOC (genomic localization), TSS (transcription start sites profile). Feature Selection: method-percentage (percentage of retained features), chi-square (chi2), recursive feature elimination (RFE). Algorithm Parameters: relevant to a scikit-learn implementation.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This repository contains some supporting data and code for the 2024 SESAR Innovation Days Conference paper "Dynamic capacity balancing in urban airspace: comparing historical and real-time aggregate flow data."
Note that this is a continuation of the work published here:
Morfin Veytia, Andres; Ellerbroek, Joost; Hoekstra, Jacco (2024): Supporting data and code for Decentralised Traffic Management for Constrained Urban Airspace: Dynamically Generating and Acting Upon Aggregate Flow Data. Version 2. 4TU.ResearchData. dataset. https://doi.org/10.4121/54825f14-8743-447d-8346-3afa46d319d6.v2
Therefore, much of the data and code is similar. However, this work provides some additional code and scenarios.
The main components need to reproduce the results are:
1. BlueSky Simulator code
This includes the BlueSky code for simulating the scenarios. This is the bluesky.zip folder. Note that the code provided is a condensed version of the one in https://github.com/amorfinv/bluesky/tree/rotterdam. The plugins and scenarios are also provided in the simulator code. The plugins are based on those based in the following repository, https://github.com/amorfinv/bluesky_plugins.
Refer to the README.md file provided to learn how to run the scenarios. Also, make sure to install a compatible python environment.
2. Post-processing code, plots, and logs
This includes the code to generate the plots seen in the paper and the logs of the simulations. It also includes some additional plots not shown in the paper. Read the README.md file for recreating the plots. This information can be found in main_experiment_results.zip. Some of the logs come from the previous work. The previous logs are labelled as real-time data labelling in this paper.
3. Voronoi creation code
The file called generate_voronois.zip includes the code to generate the voronoi code used for the historical data concept in the paper. Note that to generate the voronoi a more recent version of geopandas is necessary, so a different python environment is required. All you need is python=3.12, geopandas=1.0.1 and scikit-learn=1.5.1.
6. Python environment description
This includes the python environment used to simulate, post-process, and generate the plots. This work used conda environments. The main packages used are those required by BlueSky in addition to geopandas, osmnx, and seaborn. Note that the voronoi creation requires a different python environment.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Professor Enfuse & Learner: Comparing supervised and unsupervised learning strategies for real-world applications - Generated by Conversation Dataset Generator
This dataset was generated using the Conversation Dataset Generator script available at https://cahlen.github.io/conversation-dataset-generator/.
Generation Parameters
Number of Conversations Requested: 500 Number of Conversations Successfully Generated: 500 Total Turns: 6659 Model ID:… See the full description on the dataset page: https://huggingface.co/datasets/cahlen/cdg-AICourse-Level2-Sklearn.