Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was acquired during Cyber Czech – a hands-on cyber defense exercise (Red Team/Blue Team) held in March 2019 at Masaryk University, Brno, Czech Republic. Network traffic flows and a high variety of event logs were captured in an exercise network deployed in the KYPO Cyber Range Platform.
Contents
The dataset covers two distinct time intervals, which correspond to the official schedule of the exercise. The timestamps provided below are in the ISO 8601 date format.
The captured and collected data were normalized into three distinct event types and they are stored as structured JSON. The data are sorted by a timestamp, which represents the time they were observed. Each event type includes a raw payload ready for further processing and analysis. The description of the respective event types and the corresponding data files follows.
Each archive listed above includes a directory of the same name with the following four files, ready to be processed.
Finally, the exercise network topology is described in a machine-readable NetJSON format and it is a part of a set of auxiliary files archive – auxiliary-material.tgz – which includes the following.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The European Repository of Cyber Incidents (EuRepoC) is releasing the Global Dataset of Cyber Incidents in Version 1.3 as an extract of our backend database. This official release contains fully consolidated cyber incident data reviewed by our interdisciplinary experts in the fields of politics, law and technology across all 60 variables covered by the European Repository. Version 1.3 covers the years 2000 – 2024 entirely. The Global Dataset is meant for reliable, evidence-based analysis. If you require real-time data, please refer to the download option in our TableView or contact us for special requirements (including API access).
The dataset now contains data on 3416 cyber incidents which started between 01.01.2000 and 31.12.2024. The European Repository of Cyber Incidents (EuRepoC) gathers, codes, and analyses publicly available information from over 220 sources and 600 Twitter accounts daily to report on dynamic trends in the global, and particularly the European, cyber threat environment.
For more information on the scope and data collection methodology see: https://eurepoc.eu/methodology
Full Codebook available here
Information about each file
please scroll down this page entirely to see all files available. Zenodo only displays the attribution dataset by default.
Global Database (csv or xlsx):
This file includes all variables coded for each incident, organised such that one row corresponds to one incident - our main unit of investigation. Where multiple codes are present for a single variable for a single incident, these are separated with semi-colons within the same cell.
Receiver Dataset (csv or xlsx):
In this file, the data of affected entities and individuals (receivers) is restructured to facilitate analysis. Each cell contains only a single code, with the data "unpacked" across multiple rows. Thus, a single incident can span several rows, identifiable through the unique identifier assigned to each incident (incident_id).
Attribution Dataset (csv or xlsx):
This file follows a similar approach to the receiver dataset. The attribution data is "unpacked" over several rows, allowing each cell to contain only one code. Here too, a single incident may occupy several rows, with the unique identifier enabling easy tracking of each incident (incident_id). In addition, some attributions may also have multiple possible codes for one variable, these are also "unpacked" over several rows, with the attribution_id enabling to track each attribution.
Dyadic Dataset (csv or xlsx):
The dyadic dataset puts state dyads in the focus. Each row in the dataset represents one cyber incident in a specific dyad. Because incidents may affect multiple receivers, single incidents can be duplicated in this format, when they affected multiple countries.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Strengthen your cyber defense with our extensive, daily-updated WHOIS database. Accessible in CSV, JSON, and XML, it's a crucial asset for any security strategy.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set was acquired using a survey which intends to measure: • Participants previous experience of cybersecurity training • Participants perception of ideal cybersecurity training • Participants perception of a specific cybersecurity training type called ContextBased MicroTraining • What usability aspects the participants find most important for security features Data was acquired from Sweden, UK and Italy to allow for comparative analysis. Demographic data was collected to allow for further analysis based on those. The files included in this data set are: • Completesurvey: This document includes the full survey presented to the participants. • Dataset: This file contains the variables and data for the different questions (available as .sav (SPSS and .csv)). • Var_info: contains information about the variables in the dataset • Overview: Contains frequency tables for the survey question (for the complete data set) • Sweden, UK, and Italy: Contains frequency tables for the survey questions divided by national sample groups.
Se attahed description
Kitsune Network Attack Dataset This is a collection of nine network attack datasets captured from a either an IP-based commercial surveillance system or a network full of IoT devices. Each dataset contains millions of network packets and diffrent cyber attack within it.
For each attack, you are supplied with:
A preprocessed dataset in csv format (ready for machine learning) The corresponding label vector in csv format The original network capture in pcap format (in case you want to engineer your own features)
We will now describe in detail what's in these datasets and how they were collected.
The Network Attacks We have collected a wide variety of attacks which you would find in a real network intrusion. The following is a list of the cyber attack datasets avalaible:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F827271%2F79e305668553e521b0709a2413323c45%2Fkaggle_dataset_table.png?generation=1598461684070844&alt=media" alt="image" width="100">
For more details on the attacks themselves, please refer to our NDSS paper (citation below).
The Data Collection The following figure presents the network topologies which we used to collect the data, and the corrisponding attack vectors at which the attacks were performed. The network capture took place at point 1 and point X at the router (where a network intrusion detection system could feasibly be placed). For each dataset, clean network traffic was captured for the first 1 million packets, then the cyber attack was performed.
The Dataset Format Each preprocessed dataset csv has m rows (packets) and 115 columns (features) with no header. The 115 features were extracted using our AfterImage feature extractor, described in our NDSS paper (see below) and available in Python here. In summary, the 115 features provide a statistical snapshot of the network (hosts and behaviors) in the context of the current packet traversing the network. The AfterImage feature extractor is unique in that it can efficiently process millions of streams (network channels) in real-time, incrementally, making it suitable for handling network traffic.
Citation If you use these datasets, please cite:
@inproceedings{mirsky2018kitsune, title={Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection}, author={Mirsky, Yisroel and Doitshman, Tomer and Elovici, Yuval and Shabtai, Asaf}, booktitle={The Network and Distributed System Security Symposium (NDSS) 2018}, year={2018} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains supplementary materials for the following conference paper:
Valdemar Švábenský, Jan Vykopal, Pavel Čeleda.
What Are Cybersecurity Education Papers About? A Systematic Literature Review of SIGCSE and ITiCSE Conferences.
In Proceedings of the 51st ACM Technical Symposium on Computer Science Education (SIGCSE 2020).
https://doi.org/10.1145/3328778.3366816
Preprint available at: https://arxiv.org/abs/1911.11675
How to cite
If you use or build upon the materials, please use the BibTeX entry below to cite the original paper (not only this web link).
@inproceedings{Svabensky2020what,
author = {\v{S}v\'{a}bensk\'{y}, Valdemar and Vykopal, Jan and \v{C}eleda, Pavel},
title = {{What Are Cybersecurity Education Papers About? A Systematic Literature Review of SIGCSE and ITiCSE Conferences}},
booktitle = {Proceedings of the 51st ACM Technical Symposium on Computer Science Education},
series = {SIGCSE '20},
location = {Portland, OR, USA},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
month = {03},
year = {2020},
pages = {2--8},
numpages = {7},
isbn = {978-1-4503-6793-6},
url = {https://doi.org/10.1145/3328778.3366816},
doi = {10.1145/3328778.3366816},
}
Attached content
The file "SIGCSE 2020 Literature Review.xlsx" is an Excel spreadsheet with three sheets corresponding to 1) all papers found by automated search, 2) manually excluded papers, and 3) papers included in the literature review. There are also three CSV files that correspond to the three individual sheets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains data on 2889 cyber incidents between 01.01.2000 and 02.05.2024 using 60 variables, including the start date, names and categories of receivers along with names and categories of initiators. The database was compiled as part of the European Repository of Cyber Incidents (EuRepoC) project.
EuRepoC gathers, codes, and analyses publicly available information from over 200 sources and 600 Twitter accounts daily to report on dynamic trends in the global, and particularly the European, cyber threat environment.
For more information on the scope and data collection methodology see: https://eurepoc.eu/methodology
Codebook available here
Information about each file:
Global Database (csv or xlsx):
This file includes all variables coded for each incident, organised such that one row corresponds to one incident - our main unit of investigation. Where multiple codes are present for a single variable for a single incident, these are separated with semi-colons within the same cell.
Receiver Dataset (csv):
In this file, the data of affected entities and individuals (receivers) is restructured to facilitate analysis. Each cell contains only a single code, with the data "unpacked" across multiple rows. Thus, a single incident can span several rows, identifiable through the unique identifier assigned to each incident (incident_id).
Attribution Dataset (csv):
This file follows a similar approach to the receiver dataset. The attribution data is "unpacked" over several rows, allowing each cell to contain only one code. Here too, a single incident may occupy several rows, with the unique identifier enabling easy tracking of each incident (incident_id). In addition, some attributions may also have multiple possible codes for one variable, these are also "unpacked" over several rows, with the attribution_id enabling to track each attribution.
eurepoc_global_database_1.2 (json):
This file contains the whole database in JSON format.
DomainIQ is a comprehensive global Domain Name dataset for organizations that want to build cyber security, data cleaning and email marketing applications. The dataset consists of the DNS records for over 267 million domains, updated daily, representing more than 90% of all public domains in the world.
The data is enriched by over thirty unique data points, including identifying the mailbox provider for each domain and using AI based predictive analytics to identify elevated risk domains from both a cyber security and email sending reputation perspective.
DomainIQ from Datazag offers layered intelligence through a highly flexible API and as a dataset, available for both cloud and on-premises applications. Standard formats include CSV, JSON, Parquet, and DuckDB.
Custom options are available for any other file or database format. With daily updates and constant research from Datazag, organizations can develop their own market leading cyber security, data cleaning and email marketing applications supported by comprehensive and accurate data from Datazag. Data updates available on a daily, weekly and monthly basis. API data is updated on a daily basis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hornet 40 is a dataset of 40 days of network traffic attacks captured in cloud servers used as honeypots to help understand how geography may impact the inflow of network attacks. The honeypots are located in eight different cities: Amsterdam, London, Frankfurt, San Francisco, New York, Singapore, Toronto, Bangalore. The data was captured in April, May, and June 2021.
The eight cloud servers were created and configured simultaneously following identical instructions. The network capture was performed using the Argus network monitoring tool in each cloud server. The cloud servers had only one service running (SSH on a non-standard port) and were fully dedicated as a honeypot. No honeypot software was used in this dataset.
The dataset consists of eight scenarios, one for each geographically located cloud server. Each scenario contains bidirectional NetFlow files in the following format: - hornet40-biargus.tar.gz: all scenarios with bidirectional NetFlow files in Argus binary format; - hornet40-netflow-v5.tar.gz: all scenarios with bidirectional NetFlow v5 files in CSV format; - hornet40-netflow-extended.tar.gz: all scenarios with bidirectional NetFlows files in CSV format containing all features provided by Argus. - hornet40-full.tar.gz: download all the data (biargus, NetFlow v5, and extended NetFlows)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This traffic dataset contains a balance size of encrypted malicious and legitimate traffic for encrypted malicious traffic detection and analysis. The dataset is a secondary csv feature data that is composed of six public traffic datasets.
Our dataset is curated based on two criteria: The first criterion is to combine widely considered public datasets which contain enough encrypted malicious or encrypted legitimate traffic in existing works, such as Malware Capture Facility Project datasets. The second criterion is to ensure the final dataset balance of encrypted malicious and legitimate network traffic.
Based on the criteria, 6 public datasets are selected. After data pre-processing, details of each selected public dataset and the size of different encrypted traffic are shown in the “Dataset Statistic Analysis Document”. The document summarized the malicious and legitimate traffic size we selected from each selected public dataset, the traffic size of each malicious traffic type, and the total traffic size of the composed dataset. From the table, we are able to observe that encrypted malicious and legitimate traffic equally contributes to approximately 50% of the final composed dataset.
The datasets now made available were prepared to aim at encrypted malicious traffic detection. Since the dataset is used for machine learning or deep learning model training, a sample of train and test sets are also provided. The train and test datasets are separated based on 1:4. Such datasets can be used for machine learning or deep learning model training and testing based on selected features or after processing further data pre-processing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For the home environment we have: 01 Wifi Modem Router, 03 Smartphones, 01 server, 01 desktop, 01 Multifunction Printer, 01 network extender, 01 SmartTV, 01 Cable TV decoder and 01 firewall. This environment is a local network. The server has the Monitoring Environment and a network card, which provides connectivity and receives all network traffic for analysis.
The results were obtained from Suricata and Telegraf collections from the TICK stack. All evidence was performed by queries via EveBox, which received data from Suricata, Grafana or graphics with information extracted from the InfluxDB (Grafana) and PostgreSQL (EveBox) databases.
events.csv.gz - Suricata / Evebox collections
net.csv.gz - Telegraf collections from the TICK stack
netstat.csv.gz - Telegraf collections from the TICK stack
For correlation purposes, use the events.csv.gz file as a basis. The key to correlation is the 'timestamp' column events.csv.gz with the 'time' column in the net.csv.gz and netstat.csv.gz files.
The interval between collections, non-consecutive, was from 2018-09-15 to 2019-02-04
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is rooted in a study aimed at unveiling the origins and motivations behind the creation of malware repositories on GitHub. Our research embarks on an innovative journey to dissect the profiles and intentions of GitHub users who have been involved in this dubious activity.
Employing a robust methodology, we meticulously identified 14,000 GitHub users linked to malware repositories. By leveraging advanced large language model (LLM) analytics, we classified these individuals into distinct categories based on their perceived intent: 3,339 were deemed Malicious, 3,354 Likely Malicious, and 7,574 Benign, offering a nuanced perspective on the community behind these repositories.
Our analysis penetrates the veil of anonymity and obscurity often associated with these GitHub profiles, revealing stark contrasts in their characteristics. Malicious authors were found to typically possess sparse profiles focused on nefarious activities, while Benign authors presented well-rounded profiles, actively contributing to cybersecurity education and research. Those labeled as Likely Malicious exhibited a spectrum of engagement levels, underlining the complexity and diversity within this digital ecosystem.
We are offering two datasets in this paper. First, a list of malware repositories - we have collected and extended the malware repositories on the GitHub in 2022 following the original papers. Second, a csv file with the github users information with their maliciousness classfication label.
malware_repos.txt
username/reponame
. This format allows for easy identification and access to each repository on GitHub for further analysis or review.obfuscated_github_user_dataset.csv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1.Introduction
In the digital era of the Industrial Internet of Things (IIoT), the conventional Critical Infrastructures (CIs) are transformed into smart environments with multiple benefits, such as pervasive control, self-monitoring and self-healing. However, this evolution is characterised by several cyberthreats due to the necessary presence of insecure technologies. DNP3 is an industrial communication protocol which is widely adopted in the CIs of the US. In particular, DNP3 allows the remote communication between Industrial Control Systems (ICS) and Supervisory Control and Data Acquisition (SCADA). It can support various topologies, such as Master-Slave, Multi-Drop, Hierarchical and Multiple-Server. Initially, the architectural model of DNP3 consists of three layers: (a) Application Layer, (b) Transport Layer and (c) Data Link Layer. However, DNP3 can be now incorporated into the Transmission Control Protocol/Internet Protocol (TCP/IP) stack as an application-layer protocol. However, similarly to other industrial protocols (e.g., Modbus and IEC 60870-5-104), DNP3 is characterised by severe security issues since it does not include any authentication or authorisation mechanisms. More information about the DNP3 security issue is provided in [1-3]. This dataset contains labelled Transmission Control Protocol (TCP) / Internet Protocol (IP) network flow statistics (Common-Separated Values - CSV format) and DNP3 flow statistics (CSV format) related to 9 DNP3 cyberattacks. These cyberattacks are focused on DNP3 unauthorised commands and Denial of Service (DoS). The network traffic data are provided through Packet Capture (PCAP) files. Consequently, this dataset can be used to implement Artificial Intelligence (AI)-powered Intrusion Detection and Prevention (IDPS) systems that rely on Machine Learning (ML) and Deep Learning (DL) techniques.
2.Instructions
This DNP3 Intrusion Detection Dataset was implemented following the methodological frameworks of A. Gharib et al. in [4] and S. Dadkhah et al in [5], including eleven features: (a) Complete Network Configuration, (b) Complete Traffic, (c) Labelled Dataset, (d) Complete Interaction, (e) Complete Capture, (f) Available Protocols, (g) Attack Diversity, (h) Heterogeneity, (i) Feature Set and (j) Metadata.
A network topology consisting of (a) eight industrial entities, (b) one Human Machine Interfaces (HMI) and (c) three cyberattackers was used to implement this DNP3 Intrusion Detection Dataset. In particular, the following cyberattacks were implemented.
The aforementioned DNP3 cyberattacks were executed, utilising penetration testing tools, such as Nmap and Scapy. For each attack, a relevant folder is provided, including the network traffic and the network flow statistics for each entity. In particular, for each cyberattack, a folder is given, providing (a) the pcap files for each entity, (b) the Transmission Control Protocol (TCP)/ Internet Protocol (IP) network flow statistics for 120 seconds in a CSV format and (c) the DNP3 flow statistics for each entity (using different timeout values in terms of second (such as 45, 60, 75, 90, 120 and 240 seconds)). The TCP/IP network flow statistics were produced by using the CICFlowMeter, while the DNP3 flow statistics were generated based on a Custom DNP3 Python Parser, taking full advantage of Scapy.
3. Dataset Structure
The dataset consists of the following folders:
Each folder includes respective subfolders related to the entities/devices (described in the following section) participating in each attack. In particular, for each entity/device, there is a folder including (a) the DNP3 network traffic (pcap file) related to this entity/device during each attack, (b) the TCP/IP network flow statistics (CSV file) generated by CICFlowMeter for the timeout value of 120 seconds and finally (c) the DNP3 flow statistics (CSV file) from the Custom DNP3 Python Parser. Finally, it is noteworthy that the network flows from both CICFlowMeter and Custom DNP3 Python Parser in each CSV file are labelled based on the DNP3 cyberattacks executed for the generation of this dataset. The description of these attacks is provided in the following section, while the various features from CICFlowMeter and Custom DNP3 Python Parser are presented in Section 5.
4.Testbed & DNP3 Attacks
The following figure shows the testbed utilised for the generation of this dataset. It is composed of eight industrial entities that play the role of the DNP3 outstations/slaves, such as Remote Terminal Units (RTUs) and Intelligent Electron Devices (IEDs). Moreover, there is another workstation which plays the role of the Master station like a Master Terminal Unit (MTU). For the communication between, the DNP3 outstations/slaves and the master station, opendnp3 was used.
Table 1: DNP3 Attacks Description
DNP3 Attack |
Description |
Dataset Folder |
DNP3 Disable Unsolicited Message Attack |
This attack targets a DNP3 outstation/slave, establishing a connection with it, while acting as a master station. The false master then transmits a packet with the DNP3 Function Code 21, which requests to disable all the unsolicited messages on the target. |
20200514_DNP3_Disable_Unsolicited_Messages_Attack |
DNP3 Cold Restart Attack |
The malicious entity acts as a master station and sends a DNP3 packet that includes the “Cold Restart” function code. When the target receives this message, it initiates a complete restart and sends back a reply with the time window before the restart process. |
20200515_DNP3_Cold_Restart_Attack |
DNP3 Warm Restart Attack |
This attack is quite similar to the “Cold Restart Message”, but aims to trigger a partial restart, re-initiating a DNP3 service on the target outstation. |
20200515_DNP3_Warm_Restart_Attack |
DNP3 Enumerate Attack |
This reconnaissance attack aims to discover which DNP3 services and functional codes are used by the target system. |
20200516_DNP3_Enumerate |
DNP3 Info Attack |
This attack constitutes another reconnaissance attempt, aggregating various DNP3 diagnostic information related the DNP3 usage. |
20200516_DNP3_Ιnfo |
Data Initialisation Attack |
This cyberattack is related to Function Code 15 (Initialize Data). It is an unauthorised access attack, which demands from the slave to re-initialise possible configurations to their initial values, thus changing potential values defined by legitimate masters |
20200518_Initialize_Data_Attack |
MITM-DoS Attack |
In |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic datasets created by Single Flow Time Series Analysis
Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:
J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.
This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf
In the following table is a description of each dataset file:
File name | Detection problem | Citation of original raw dataset |
botnet_binary.csv | Binary detection of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
botnet_multiclass.csv | Multi-class classification of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
cryptomining_design.csv | Binary detection of cryptomining; the design part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
cryptomining_evaluation.csv | Binary detection of cryptomining; the evaluation part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
dns_malware.csv | Binary detection of malware DNS | Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021. |
doh_cic.csv | Binary detection of DoH |
Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020 |
doh_real_world.csv | Binary detection of DoH | Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022 |
dos.csv | Binary detection of DoS | Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019. |
edge_iiot_binary.csv | Binary detection of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
edge_iiot_multiclass.csv | Multi-class classification of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
https_brute_force.csv | Binary detection of HTTPS Brute Force | Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020 |
ids_cic_binary.csv | Binary detection of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_cic_multiclass.csv | Multi-class classification of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_unsw_nb_15_binary.csv | Binary detection of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
ids_unsw_nb_15_multiclass.csv | Multi-class classification of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
iot_23.csv | Binary detection of IoT malware | Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23 |
ton_iot_binary.csv | Binary detection of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
ton_iot_multiclass.csv | Multi-class classification of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
tor_binary.csv | Binary detection of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
tor_multiclass.csv | Multi-class classification of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
vpn_iscx_binary.csv | Binary detection of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_iscx_multiclass.csv | Multi-class classification of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_vnat_binary.csv | Binary detection of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
vpn_vnat_multiclass.csv | Multi-class classification of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a collection of datasets from different sources related to the automatic detection of cyber-bullying. The data is from different social media platforms like Kaggle, Twitter, Wikipedia Talk pages and YouTube. The data contain text and labeled as bullying or not. The data contains different types of cyber-bullying like hate speech, aggression, insults and toxicity.
The data is from different social media platforms like Kaggle, Twitter, Wikipedia Talk pages and YouTube. The data contain text and labeled as bullying or not. The data contains different types of cyber-bullying like hate speech, aggression, insults and toxicity.
Elsafoury, Fatma (2020), “Cyberbullying datasets”, Mendeley Data, V1, doi: 10.17632/jf4pzyvnpj.1
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MIL-STD-1553 is a military standard that defines the protocol characteristics of a data bus medium for the exchange of information between various subsystems. Although the threat of cyber-attacks on the MIL-STD-1553 protocol has become a growing concern in recent years, little work has been published on detecting such attacks. One of the primary reasons for this is the confidentiality of data recorded from buses on operational systems and as a result, lack of data availability. Moreover, existing research doesn’t sufficiently emphasize the complexity of detecting attacks that can be camouflaged by normal non-periodic messages that the MIL-STD-1553 supports.
We present three datasets of synthesized MIL-STD-1553 traffic containing injected RT Spoofing Attack messages. The implemented attacks emulate normal non-periodical communication so detecting them with a low false positive rate is non-trivial. Each dataset is separated into a training set of normal messages and a test set of both normal and attack messages. The test sets differ by the occurrence rate of attack messages (0.01%, 0.1%, and 1%). Each dataset is also preprocessed into a dataset of message sequences so that it can be used for sequential anomaly detection analysis. The sequential test sets differ by the occurrence rate of attack sequences (0.14%, 1.26%, and 11.01%). A Java program for generating the sequence datasets from the message stream datasets is also included so users can generate new sequence datasets with different sequence lengths or a labeling according to whether or not the message was injected instead of whether or not it affected the aircraft's behavior.
These datasets are intended to serve three primary purposes: (1) evaluate the ability of MIL-STD-1553 intrusion detection systems (IDS) to detect attacks that emulate normal non-periodical traffic; (2) evaluate IDSs on differing occurrence rates of attacks; (3) evaluate and compare IDSs that operate on non-sequential data as well as IDSs that operate on sequential data.
Please refer to the linked data description document for the full details of the data synthesis process, the motivation for our preprocessing into sequences, and the format of the CSV files. This document also provides relevant background on detecting Spoofing Attacks from MIL-STD-1553 Traffic.
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
This dataset contains all exploits registered on the exploit-db website, from 02 January 2019 to 06 November 2020. 2,665 exploits were found in this time range, and stored in CSV file. The CSV fields are as follows:
Altosight | AI Custom Web Scraping Data
✦ Altosight provides global web scraping data services with AI-powered technology that bypasses CAPTCHAs, blocking mechanisms, and handles dynamic content.
We extract data from marketplaces like Amazon, aggregators, e-commerce, and real estate websites, ensuring comprehensive and accurate results.
✦ Our solution offers free unlimited data points across any project, with no additional setup costs.
We deliver data through flexible methods such as API, CSV, JSON, and FTP, all at no extra charge.
― Key Use Cases ―
➤ Price Monitoring & Repricing Solutions
🔹 Automatic repricing, AI-driven repricing, and custom repricing rules 🔹 Receive price suggestions via API or CSV to stay competitive 🔹 Track competitors in real-time or at scheduled intervals
➤ E-commerce Optimization
🔹 Extract product prices, reviews, ratings, images, and trends 🔹 Identify trending products and enhance your e-commerce strategy 🔹 Build dropshipping tools or marketplace optimization platforms with our data
➤ Product Assortment Analysis
🔹 Extract the entire product catalog from competitor websites 🔹 Analyze product assortment to refine your own offerings and identify gaps 🔹 Understand competitor strategies and optimize your product lineup
➤ Marketplaces & Aggregators
🔹 Crawl entire product categories and track best-sellers 🔹 Monitor position changes across categories 🔹 Identify which eRetailers sell specific brands and which SKUs for better market analysis
➤ Business Website Data
🔹 Extract detailed company profiles, including financial statements, key personnel, industry reports, and market trends, enabling in-depth competitor and market analysis
🔹 Collect customer reviews and ratings from business websites to analyze brand sentiment and product performance, helping businesses refine their strategies
➤ Domain Name Data
🔹 Access comprehensive data, including domain registration details, ownership information, expiration dates, and contact information. Ideal for market research, brand monitoring, lead generation, and cybersecurity efforts
➤ Real Estate Data
🔹 Access property listings, prices, and availability 🔹 Analyze trends and opportunities for investment or sales strategies
― Data Collection & Quality ―
► Publicly Sourced Data: Altosight collects web scraping data from publicly available websites, online platforms, and industry-specific aggregators
► AI-Powered Scraping: Our technology handles dynamic content, JavaScript-heavy sites, and pagination, ensuring complete data extraction
► High Data Quality: We clean and structure unstructured data, ensuring it is reliable, accurate, and delivered in formats such as API, CSV, JSON, and more
► Industry Coverage: We serve industries including e-commerce, real estate, travel, finance, and more. Our solution supports use cases like market research, competitive analysis, and business intelligence
► Bulk Data Extraction: We support large-scale data extraction from multiple websites, allowing you to gather millions of data points across industries in a single project
► Scalable Infrastructure: Our platform is built to scale with your needs, allowing seamless extraction for projects of any size, from small pilot projects to ongoing, large-scale data extraction
― Why Choose Altosight? ―
✔ Unlimited Data Points: Altosight offers unlimited free attributes, meaning you can extract as many data points from a page as you need without extra charges
✔ Proprietary Anti-Blocking Technology: Altosight utilizes proprietary techniques to bypass blocking mechanisms, including CAPTCHAs, Cloudflare, and other obstacles. This ensures uninterrupted access to data, no matter how complex the target websites are
✔ Flexible Across Industries: Our crawlers easily adapt across industries, including e-commerce, real estate, finance, and more. We offer customized data solutions tailored to specific needs
✔ GDPR & CCPA Compliance: Your data is handled securely and ethically, ensuring compliance with GDPR, CCPA and other regulations
✔ No Setup or Infrastructure Costs: Start scraping without worrying about additional costs. We provide a hassle-free experience with fast project deployment
✔ Free Data Delivery Methods: Receive your data via API, CSV, JSON, or FTP at no extra charge. We ensure seamless integration with your systems
✔ Fast Support: Our team is always available via phone and email, resolving over 90% of support tickets within the same day
― Custom Projects & Real-Time Data ―
✦ Tailored Solutions: Every business has unique needs, which is why Altosight offers custom data projects. Contact us for a feasibility analysis, and we’ll design a solution that fits your goals
✦ Real-Time Data: Whether you need real-time data delivery or scheduled updates, we provide the flexibility to receive data when you need it. Track price changes, monitor product trends, or gather...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a collection of manually curated timelines from the 2019 Collegiate Penetration Testing Competition (CPTC). Collection and annotation are described in detail in this publication:
Included Files
Other Resources
Contact
Please contact Benjamin S. Meyers (email) with questions about this data and its collection.
Acknowledgments
Collection of this data has been sponsored in part by the National Science Foundation grant 1922169, and by a Department of Defense DARPA SBIR program (grant 140D63-19-C-0018).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We have for the Internet environment: 01 Switch, 01 IP camera, 01 server for monitoring, 01 server for honeypot and no firewall. This environment is directly connected to the Internet. We installed a server, functioning as a Monitoring Environment. The network traffic was obtained via Port Mirroring on the switch to the Monitoring Environment server.
The results were obtained from Suricata and Telegraf collections from the TICK stack. All evidence was performed by queries via EveBox, which received data from Suricata, Grafana or graphics with information extracted from the InfluxDB (Grafana) and PostgreSQL (EveBox) databases.
events.csv.gz - Suricata / Evebox collections
net.csv.gz - Telegraf collections from the TICK stack
netstat.csv.gz - Telegraf collections from the TICK stack
For correlation purposes, use the events.csv.gz file as a basis. The key to correlation is the 'timestamp' column events.csv.gz with the 'time' column in the net.csv.gz and netstat.csv.gz files.
The interval between collections, non-consecutive, was from 2018-08-28 to 2019-11-14
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was acquired during Cyber Czech – a hands-on cyber defense exercise (Red Team/Blue Team) held in March 2019 at Masaryk University, Brno, Czech Republic. Network traffic flows and a high variety of event logs were captured in an exercise network deployed in the KYPO Cyber Range Platform.
Contents
The dataset covers two distinct time intervals, which correspond to the official schedule of the exercise. The timestamps provided below are in the ISO 8601 date format.
The captured and collected data were normalized into three distinct event types and they are stored as structured JSON. The data are sorted by a timestamp, which represents the time they were observed. Each event type includes a raw payload ready for further processing and analysis. The description of the respective event types and the corresponding data files follows.
Each archive listed above includes a directory of the same name with the following four files, ready to be processed.
Finally, the exercise network topology is described in a machine-readable NetJSON format and it is a part of a set of auxiliary files archive – auxiliary-material.tgz – which includes the following.