https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context
The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.
Content :
This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.
The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).
Dataset Columns:
No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance
Acknowledgements :
I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.
Ravikumar Gattu , Susmitha Choppadandi
Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).
**Dataset License: ** CC0: Public Domain
Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
ML techniques benefits from this Dataset :
This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :
Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.
Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.
3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
InputData DirectoryThis network dataset is an example of a network to which paths from other networks (i.e. Networks A and B) can be compared.Contains two directories:a) NetworkCb) NetworkPaths'NetworkC' Directory- This network is based upon a subset of the Missouri Department of Transportation (MoDOT) July 2016 road dataset listed in the references.- NetworkC contains an ESRI .gdb (NetworkCdata.gdb) in which the arcs and nodes for Network C can be found as well as an ArcGIS ND Network Analyst configuration file. - Featuredataset: NetworkCsub - Network arcs: NetworkCsub - Network file: NetworkCsub_ND - Network junctions: NetworkCsub_ND_Junctions'NetworkPaths' contains ESRI .gdbs representing:a) A collection of routes between OD pairs in each network (InputPaths.gdb) - The densified routes used in the application (densified at 10m): (Net_A_routelines; Net_B_routelines; Net_C_routelines) - The original routes with original set of vertices (non densified): (Net_A_routes; Net_B_routes; Net_C_routes)b) The origin and destination points for the paths (ODNodes.gdb) - These were used to generate the shortest paths for each network, serving as the paths to be compared - origins: originLocations - destinations: destinationLocations_'OutputData' DirectoryContains the comparisons of paths to networks:NetAToB: comparison of paths from network A to network BNetAToC: comparison of paths from network A to network CNetBToA: comparison of paths from network B to network ANetBToC: comparison of paths from network B to network CNetCToA: comparison of paths from network C to network ANetCToB: comparison of paths from network C to network BInside each directory is a collection of ESRI .gdb which contains the individual paths used in the analysis as inputa) NetworkAPaths.gdbb) NetworkBPaths.gdbc) NetworkCPaths.gdbInside each directory is a collection of ESRI .gdb which contains the vertices of the individual paths used in the analysis as inputa) NetworkAPathPoints.gdbb) NetworkBPathPoints.gdbc) NetworkCPathPoints.gdbAlso included is a collection of ESRI .gdb that represent the original path nodes that could be assigned to the comparison network. In this case, only nodes that were within 20m of the comparison network could be assigned. Each path node is attributed with the distance to its counterpart node in the comparison. a) Nodes in Network A paths assigned to Network B (PathANodesAssignedtoNetB.gdb)b) Nodes in Network A paths assigned to Network C (PathANodesAssignedtoNetC.gdb)c) Nodes in Network B paths assigned to Network A (PathBNodesAssignedtoNetA.gdb)d) Nodes in Network B paths assigned to Network C (PathBNodesAssignedtoNetC.gdb)e) Nodes in Network C paths assigned to Network A (PathCNodesAssignedtoNetA.gdb)f) Nodes in Network C paths assigned to Network B (PathCNodesAssignedtoNetB.gdb)Inside each directory is a collection of ESRI .gdb which contain solutions to the SCPPOD with the following naming convention:a) comparing paths in Network A to Network B SCCPODarcsPathAtoNetB.gdb for arc elements and SCCPODnodesPathAtoNetB.gdb for node elements) - The naming convention for the node solutions for path id X is ('SN_routeX_X') - The naming convention for the arc solutions for path id X is ('routX_Rt' for single polyline counterpart path; and 'routeX_Rtsplit' for a polyline representation of the counterpart path based upon the SCPPOD node output).b) comparing paths in Network A to Network C SCCPODarcsPathAtoNetC.gdb for arc elements and SCCPODnodesPathAtoNetC.gdb for node elements)c) comparing paths in Network B to Network A SCCPODarcsPathBtoNetA.gdb for arc elements and SCCPODnodesPathBtoNetA.gdb for node elements)d) comparing paths in Network B to Network C SCCPODarcsPathBtoNetC.gdb for arc elements and SCCPODnodesPathBtoNetC.gdb for node elements)e) comparing paths in Network C to Network A SCCPODarcsPathCtoNetA.gdb for arc elements and SCCPODnodesPathCtoNetA.gdb for node elements)f) comparing paths in Network C to Network B SCCPODarcsPathCtoNetB.gdb for arc elements and SCCPODnodesPathCtoNetB.gdb for node elements)The counterpart paths that were identified were then linked to the full network C to summarize the frequency with with arcs were associated with paths - Can be found in: 1. PathARepresentationinNetC.gdb 2. PathARepresentationinNetC.gdb - important attributes: a) vcntarc: number of paths utilizing arc b) ptCnt: number of path vertices associated with each arc c) AvgDist: average distance of path vertices from network arcs d) MinDist: minimum distance of path vertices from network arcs e) MaxDist: minimum distance of path vertices from network arcs
Documentation for Network Traffic Dataset
Dataset Overview
This dataset consists of network traffic captured from a Kali Linux machine, aimed at helping the development and evaluation of machine learning models for distinguishing between normal and malicious (specifically flood attack) network activities. It includes a variety of features essential for identifying potential cybersecurity threats alongside labels indicating whether each packet is part of flood traffic.
Data Collection Methodology
The dataset was carefully compiled using network traffic captured from a dedicated Kali Linux setup. The capture environment consisted of a Kali Linux machine configured to generate and capture both normal and malicious network traffic and a target machine running a Windows OS to simulate a real-world network environment.
Traffic Generation:
Normal Traffic: Involved routine network activities such as web browsing and pinging between the Kali Linux machine and the Windows machine.
Malicious Traffic: Utilized hping3 to simulate flood attacks, specifically ICMP flood attacks, targeting the Windows machine from the Kali Linux machine [1].
Capture Process: Wireshark was used on the Kali Linux machine to capture all incoming and outgoing network traffic [2]. The capture was set up to record detailed packet information, including timestamps, source and destination IP addresses, ports, and protocols. The captures were conducted with careful monitoring to precisely mark the start and end times of the flood attack for accurate dataset labeling.
Dataset Description
The dataset is a CSV file containing a comprehensive collection of network traffic packets labeled to distinguish between normal and malicious traffic. It includes the following columns:
Timestamp: The capture time of each packet, providing insights into the traffic flow and enabling analysis of traffic patterns over time. Source IP Address: Identifies the origin of the packet, crucial for pinpointing potential sources of attacks. Destination IP Address: Indicates the packet's intended recipient, useful for identifying targeted resources. Source Port and Destination Port: Offer insights into the services involved in the communication. Protocol: Specifies the protocol used, such as TCP, UDP, or ICMP, essential for analyzing the nature of the traffic. Length: The size of the packet in bytes, which can signal unusual traffic patterns often associated with malicious activities. bad_packet: A binary label with 1 indicating traffic identified as part of a flood attack and 0 denoting normal traffic. Precise timestamps marking the start and end of flood attacks were used to accurately label this column. Packets captured within these defined intervals were marked as malicious (bad_packet = 1), whereas all others were considered normal traffic. Python and Pandas were used for the labeling process [3][4].
Potential Applications
a. Intrusion Detection Systems (IDS): The dataset can be used in training models to enhance IDS capabilities, enabling more effective detection of flood-based network attacks. b. Network Traffic Monitoring: Tools making use of machine learning can leverage the dataset for more accurate network traffic monitoring, identifying and alerting suspicious activities in real time. c. Cybersecurity Training: Educational institutions and training programs can use the dataset to provide practical experience in machine learning-based threat detection.
Proposed Machine Learning Technique: Supervised Machine Learning, specifically Deep Learning with Convolutional Neural Networks (CNNs).
CNNs, even though it is usually used for image processing, have shown promise in analyzing sequential data. The spatial hierarchy in network packets (from individual bytes to overall packet structure) can be analogous to the patterns CNNs excel at identifying. Utilizing CNNs could allow for the extraction of complex data in network traffic that indicate malicious activities, improving detection accuracy beyond traditional methods.
Conclusion
This dataset represents a significant step towards using machine learning for cybersecurity, specifically in the field of intrusion detection and network monitoring. By providing a detailed and accurately labeled dataset of normal and malicious network traffic, it lays the groundwork for developing complex models capable of identifying and mitigating flood attacks in real-time. In the future, we could include a broader range of attack types and more traffic patterns, further enhancing the dataset's utility and the effectiveness of models trained on it.
References [1] https://linux.die.net/man/8/hping3 [2] https://www.wireshark.org/docs/ [3] https://pandas.pydata.org/docs/ [4] https://docs.python.org/3/tutorial/index.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Helsinki Region Travel Time Matrix contains travel time and distance information for routes between all 250 m x 250 m grid cell centroids (n = 13231) in the Helsinki Region, Finland by walking, cycling, public transportation and car. The grid cells are compatible with the statistical grid cells used by Statistics Finland and the YKR (yhdyskuntarakenteen seurantajärjestelmä) data set. The Helsinki Region Travel Time Matrix is available for three different years:
The data consists of travel time and distance information of the routes that have been calculated between all statistical grid cell centroids (n = 13231) by walking, cycling, public transportation and car.
The data have been calculated for two different times of the day: 1) midday and 2) rush hour.
The data may be used freely (under Creative Commons 4.0 licence). We do not take any responsibility for any mistakes, errors or other deficiencies in the data.
Organization of data
The data have been divided into 13231 text files according to destinations of the routes. The data files have been organized into sub-folders that contain multiple (approx. 4-150) Travel Time Matrix result files. Individual folders consist of all the Travel Time Matrices that have same first four digits in their filename (e.g. 5785xxx).
In order to visualize the data on a map, the result tables can be joined with the MetropAccess YKR-grid shapefile (attached here). The data can be joined by using the field ‘from_id’ in the text files and the field ‘YKR_ID’ in MetropAccess-YKR-grid shapefile as a common key.
Data structure
The data have been divided into 13231 text files according to destinations of the routes. One file includes the routes from all statistical grid cells to a particular destination grid cell. All files have been named according to the destination grid cell code and each file includes 13231 rows.
NODATA values have been stored as value -1.
Each file consists of 17 attribute fields: 1) from_id, 2) to_id, 3) walk_t, 4) walk_d, 5) bike_f_t, 6) bike_s_t, 7) bike_d, 8) pt_r_tt, 9) pt_r_t, 10) pt_r_d, 11) pt_m_tt, 12) pt_m_t, 13) pt_m_d, 14) car_r_t, 15) car_r_d, 16) car_m_t, 17) car_m_d, 18) car_sl_t
The fields are separated by semicolon in the text files.
Attributes
METHODS
For detailed documentation and how to reproduce the data, see HelsinkiRegionTravelTimeMatrix2018 GitHub repository.
THE ROUTE BY CAR have been calculated with a dedicated open source tool called DORA (DOor-to-door Routing Analyst) developed for this project. DORA uses PostgreSQL database with PostGIS extension and is based on the pgRouting toolkit. MetropAccess-Digiroad (modified from the original Digiroad data provided by Finnish Transport Agency) has been used as a street network in which the travel times of the road segments are made more realistic by adding crossroad impedances for different road classes.
The calculations have been repeated for two times of the day using 1) the “midday impedance” (i.e. travel times outside rush hour) and 2) the “rush hour impendance” as impedance in the calculations. Moreover, there is 3) the “speed limit impedance” calculated in the matrix (i.e. using speed limit without any additional impedances).
The whole travel chain (“door-to-door approach”) is taken into account in the calculations:
1) walking time from the real origin to the nearest network location (based on Euclidean distance),
2) average walking time from the origin to the parking lot,
3) travel time from parking lot to destination,
4) average time for searching a parking lot,
5) walking time from parking lot to nearest network location of the destination and
6) walking time from network location to the real destination (based on Euclidean distance).
THE ROUTES BY PUBLIC TRANSPORTATION have been calculated by using the MetropAccess-Reititin tool which also takes into account the whole travel chains from the origin to the destination:
1) possible waiting at home before leaving,
2) walking from home to the transit stop,
3) waiting at the transit stop,
4) travel time to next transit stop,
5) transport mode change,
6) travel time to next transit stop and
7) walking to the destination.
Travel times by public transportation have been optimized using 10 different departure times within the calculation hour using so called Golomb ruler. The fastest route from these calculations are selected for the final travel time matrix.
THE ROUTES BY CYCLING are also calculated using the DORA tool. The network dataset underneath is MetropAccess-CyclingNetwork, which is a modified version from the original Digiroad data provided by Finnish Transport Agency. In the dataset the travel times for the road segments have been modified to be more realistic based on Strava sports application data from the Helsinki region from 2016 and the bike sharing system data from Helsinki from 2017.
For each road segment a separate speed value was calculated for slow and fast cycling. The value for fast cycling is based on a percentual difference between segment specific Strava speed value and the average speed value for the whole Strava data. This same percentual difference has been applied to calculate the slower speed value for each road segment. The speed value is then the average speed value of bike sharing system users multiplied by the percentual difference value.
The reference value for faster cycling has been 19km/h, which is based on the average speed of Strava sports application users in the Helsinki region. The reference value for slower cycling has been 12km/, which has been the average travel speed of bike sharing system users in Helsinki. Additional 1 minute have been added to the travel time to consider the time for taking (30s) and returning (30s) bike on the origin/destination.
More information of the Strava dataset that was used can be found from the Cycling routes and fluency report, which was published by us and the city of Helsinki.
THE ROUTES BY WALKING were also calculated using the MetropAccess-Reititin by disabling all motorized transport modesin the calculation. Thus, all routes are based on the Open Street Map geometry.
The walking speed has been adjusted to 70 meters per minute, which is the default speed in the HSL Journey Planner (also in the calculations by public transportation).
All calculations were done using the computing resources of CSC-IT Center for Science (https://www.csc.fi/home).
This dataset provides Origin and Destination reports derived from the Automatic Number Plate Recognition (ANPR) camera traffic survey undertaken across the Cambridge area from 10th to 17th June 2017. The aim of the survey work was to help provide a firm evidence base for future Greater Cambridge Partnership decisions, by improving our understanding of how the network is being used and the impacts of vehicle use. The Origin and Destination Reports provide information on the first and last cameras triggered on vehicle journeys across the city. Please note that the maximum trip chain duration within the reports is two hours, and that vehicles travelling ‘outbound’ past an external camera site will end that particular trip chain. The ‘Taxi’ classification includes only Hackney Carriages. Please also note that these reports are preliminary and are undergoing review. The reports may be subject to change and revisions released in the fullness of time. The Greater Cambridge Partnership team welcome your feedback. Please email us on contactus@greatercambridge.org.uk. The Trip Chain Reports (available at http://opendata.cambridgeshireinsight.org.uk/dataset/greater-cambridge-a...) provide additional detail, giving the camera survey sites triggered along vehicles’ routes across the Highway network. ****due to the extensive amount of data recorded, data collected for each day has been divided into three files. Each file contains a 'summary' worksheet for the relevant day but the data for individual cameras have been divided for each day between camera location 1-45, 46-70 and 72-96. You can view the camera locations on the 'location plan' worksheet of each file.
This data set was acquired by the USDOT Data Capture and Management program. The purpose of the data set is to provide multi-modal data and contextual information (weather and incidents) that can be used to research and develop applications. Contains one full year (January – December 2010) of raw 30-second data for over 3,000 traffic detectors deployed along 1,250 lane miles of monitored roadway in San Diego. Cleaned and geographically referenced data for over 1,500 incidents and lane closures for the two sections of I-5 that experienced the greatest number of incidents during 2010. Complete trip (origin-to-destination) GPS “breadcrumbs” collected by ALK Techonologies, containing latitude/longitude, vehicle heading and speed data, and time for individual in-vehicles devices updated at 3-second intervals for over 10,000 trips taken during 2010. A digital map shape file containing ALK’s street-level network data for the San Diego Metropolitan area. And San Diego Weather data for 2010. This legacy dataset was created before data.transportation.gov and is only currently available via the attached file(s). Please contact the dataset owner if there is a need for users to work with this data using the data.transportation.gov analysis features (online viewing, API, graphing, etc.) and the USDOT will consider modifying the dataset to fully integrate in data.transportation.gov.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The NetworkDataFolder includes a network dataset created to investigate wetland connectivity using a multi-criteria optimization approach. Three arc attributes represent cost associated with movement on the landscape. The digital elevation model is used to compute the topographic wetness index and cost related to elevation change. The landuse/landcover layer is basis for calculating likelihood of successful traversal. The built network consists of 12 wetlands serving as origin and destination, and 1277 network arcs. For each arc there are several arc attributes as described below:ID1: reference to arc endpoint at the lower altitudeID2: reference to arc endpoint at the higher altitudeDEM1: elevation at endpoint 1DEM2: elevation at endpoint 2DEMc: elevation at midpoint of arcTWI1: topographic wetness index at endpoint 1TWI2: topographic wetness index at endpoint 2TWIc: topographic wetness index at midpoint of arcLUL1: landuse/landcover type at endpoint 1LUL2: landuse/landcover type at endpoint 2LULc: landuse/landcover type at midpoint of arcLULb: base successful traversal likelihood associated with each landuse/landcover typeLULf: final value of successful traversal likelihood (considering arc length)DIST: arc length in metersThe ImplementationScript folder includes two solution approaches to identify Pareto-optimal solutions on the frontier of objective space in our multiobjective optimization model. The exact approach constructs the full efficient set and the approximate approach estimates the supported efficient solutions. The output for two solution methods are available in the PathFolder.
This dataset contains data on original and post-calibration mileposts, Traffic Message Channel location codes (TMC), Truck Travel Time Reliability Index, Travel Time Index (TTI), TMC mileage, and corridor identification segment for the destination to origin direction of the National Performance Monitoring Research Data Set (NPMRDS) network.
To generate a representative dataset of real-world traffic in ISCX we defined a set of tasks, assuring that our dataset is rich enough in diversity and quantity. We created accounts for users Alice and Bob in order to use services like Skype, Facebook, etc. Below we provide the complete list of different types of traffic and applications considered in our dataset for each traffic type (VoIP, P2P, etc.)
We captured a regular session and a session over VPN, therefore we have a total of 14 traffic categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc. We also give a detailed description of the different types of traffic generated:
Browsing: Under this label we have HTTPS traffic generated by users while browsing or performing any task that includes the use of a browser. For instance, when we captured voice-calls using hangouts, even though browsing is not the main activity, we captured several browsing flows.
Email: The traffic samples generated using a Thunderbird client, and Alice and Bob Gmail accounts. The clients were configured to deliver mail through SMTP/S, and receive it using POP3/SSL in one client and IMAP/SSL in the other.
Chat: The chat label identifies instant-messaging applications. Under this label we have Facebook and Hangouts via web browsers, Skype, and IAM and ICQ using an application called pidgin [14].
Streaming: The streaming label identifies multimedia applications that require a continuous and steady stream of data. We captured traffic from Youtube (HTML5 and flash versions) and Vimeo services using Chrome and Firefox.
File Transfer: This label identifies traffic applications whose main purpose is to send or receive files and documents. For our dataset we captured Skype file transfers, FTP over SSH (SFTP) and FTP over SSL (FTPS) traffic sessions.
VoIP: The Voice over IP label groups all traffic generated by voice applications. Within this label we captured voice calls using Facebook, Hangouts and Skype.
TraP2P: This label is used to identify file-sharing protocols like Bittorrent. To generate this traffic we downloaded different .torrent files from a public a repository and captured traffic sessions using the uTorrent and Transmission applications.
The traffic was captured using Wireshark and tcpdump, generating a total amount of 28GB of data. For the VPN, we used an external VPN service provider and connected to it using OpenVPN (UDP mode). To generate SFTP and FTPS traffic we also used an external service provider and Filezilla as a client.
To facilitate the labeling process, when capturing the traffic all unnecessary services and applications were closed. (The only application executed was the objective of the capture, e.g., Skype voice-call, SFTP file transfer, etc.) We used a filter to capture only the packets with source or destination IP, the address of the local client (Alice or Bob).
The full research paper outlining the details of the dataset and its underlying principles:
Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy.
ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on selected features. The UNB ISCX Network Traffic (VPN-nonVPN) dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by ISCXFlowMeter) also are publicly available for researchers.
For more information contact cic@unb.ca.
The UNB ISCX Network Traffic Dataset content
Traffic: Content
Web Browsing: Firefox and Chrome
Email: SMPTS, POP3S and IMAPS
Chat: ICQ, AIM, Skype, Facebook and Hangouts
Streaming: Vimeo and Youtube
File Transfer: Skype, FTPS and SFTP using Filezilla and an external service
VoIP: Facebook, Skype and Hangouts voice calls (1h duration)
P2P: uTorrent and Transmission (Bittorrent)
; cic@unb.ca.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record provides a dataset created as part of the study presented in the following publication and is made publicly available for research purposes. The associated article provides a comprehensive description of the dataset, its structure, and the methodology used in its creation. If you use this dataset, please cite the following article published in the journal IEEE Communications Magazine:
A. Karamchandani, J. Nunez, L. de-la-Cal, Y. Moreno, A. Mozo, and A. Pastor, “On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination,” IEEE Communications Magazine, pp. 2–8, 2025, DOI: 10.1109/MCOM.003.2400648.
More specifically, the record contains several synthetic datasets generated to differentiate between benign and malicious heavy hitter flows within a realistic virtualized network environment. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.
To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.
The feature set includes the following flow statistics commonly used in the literature on network traffic classification:
To accommodate diverse research needs and scenarios, the dataset is provided in the following variations:
All at Once
:
Balanced Traffic Generation
:
DDoS at Intervals
:
Only Benign HH Traffic
:
Only DDoS Traffic
:
Only Normal Traffic
:
Unbalanced Traffic Generation
:
For each variation, the output of the different packet aggregators is provided separated in its respective folder.
Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic datasets with novel extended IP flow called NetTiSA flow
Datasets were created for the paper: NetTiSA: Extended IP Flow with Time-series Features for Universal Bandwidth-constrained High-speed Network Traffic Classification -- Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka -- which is published in The International Journal of Computer and Telecommunications Networking https://doi.org/10.1016/j.comnet.2023.110147
Please cite the usage of our datasets as:
Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka, "NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification", Computer Networks, Volume 240, 2024, 110147, ISSN 1389-1286
@article{KOUMAR2024110147, title = {NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification}, journal = {Computer Networks}, volume = {240}, pages = {110147}, year = {2024}, issn = {1389-1286}, doi = {https://doi.org/10.1016/j.comnet.2023.110147}, url = {https://www.sciencedirect.com/science/article/pii/S1389128623005923}, author = {Josef Koumar and Karel Hynek and Jaroslav Pešek and Tomáš Čejka} }
This Zenodo repository contains 23 datasets created from 15 well-known published datasets, which are cited in the table below. Each dataset contains the NetTiSA flow feature vector.
NetTiSA flow feature vector
The novel extended IP flow called NetTiSA (Network Time Series Analysed) flow contains a universal bandwidth-constrained feature vector consisting of 20 features. We divide the NetTiSA flow classification features into three groups by computation. The first group of features is based on classical bidirectional flow information---a number of transferred bytes, and packets. The second group contains statistical and time-based features calculated using the time-series analysis of the packet sequences. The third type of features can be computed from the previous groups (i.e., on the flow collector) and improve the classification performance without any impact on the telemetry bandwidth.
Flow features
The flow features are:
Statistical and Time-based features
The features that are exported in the extended part of the flow. All of them can be computed (exactly or in approximative) by stream-wise computation, which is necessary for keeping memory requirements low. The second type of feature set contains the following features:
where \(s_n\) is number of switches.
Features computed at the collector
The third set contains features that are computed from the previous two groups prior to classification. Therefore, they do not influence the network telemetry size and their computation does not put additional load to resource-constrained flow monitoring probes. The NetTiSA flow combined with this feature set is called the Enhanced NetTiSA flow and contains the following features:
The NetTiSA flow is implemented into IP flow exporter ipfixprobe.
Description of dataset files
In the following table is a description of each dataset file:
File name |
Detection problem |
Citation of the original raw dataset |
botnet_binary.csv | Binary detection of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
botnet_multiclass.csv | Multi-class classification of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
cryptomining_design.csv | Binary detection of cryptomining; the design part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
cryptomining_evaluation.csv | Binary detection of cryptomining; the evaluation part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
dns_malware.csv | Binary detection of malware DNS | Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021. |
doh_cic.csv | Binary detection of DoH | Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020 |
doh_real_world.csv | Binary detection of DoH | Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022 |
dos.csv | Binary detection of DoS | Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019. |
edge_iiot_binary.csv | Binary detection of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
edge_iiot_multiclass.csv | Multi-class classification of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: |
This dataset is comprised of NetFlow records, which capture the outbound network traffic of 8 commercial IoT devices and 5 non-IoT devices, collected during a period of 37 days in a lab at Ben-Gurion University of The Negev. The dataset was collected in order to develop a method for telecommunication providers to detect vulnerable IoT models behind home NATs. Each NetFlow record is labeled with the device model which produced it; for research reproducibilty, each NetFlow is also allocated to either the "training" or "test" set, in accordance with the partitioning described in:
Y. Meidan, V. Sachidananda, H. Peng, R. Sagron, Y. Elovici, and A. Shabtai, A novel approach for detecting vulnerable IoT devices connected behind a home NAT, Computers & Security, Volume 97, 2020, 101968, ISSN 0167-4048, https://doi.org/10.1016/j.cose.2020.101968. (http://www.sciencedirect.com/science/article/pii/S0167404820302418)
Please note:
# NetFlow features, used in the related paper for analysis
'FIRST_SWITCHED': System uptime at which the first packet of this flow was switched
'IN_BYTES': Incoming counter for the number of bytes associated with an IP Flow
'IN_PKTS': Incoming counter for the number of packets associated with an IP Flow
'IPV4_DST_ADDR': IPv4 destination address
'L4_DST_PORT': TCP/UDP destination port number
'L4_SRC_PORT': TCP/UDP source port number
'LAST_SWITCHED': System uptime at which the last packet of this flow was switched
'PROTOCOL': IP protocol byte (6: TCP, 17: UDP)
'SRC_TOS': Type of Service byte setting when there is an incoming interface
'TCP_FLAGS': Cumulative of all the TCP flags seen for this flow
# Features added by the authors
'IP': Prefix of the destination IP address, representing the network (without the host)
'DURATION': Time (seconds) between first/last packet switching
# Label
'device_model':
# Partition
'partition': Training or test
# Additional NetFlow features (mostly zero-variance)
'SRC_AS': Source BGP autonomous system number
'DST_AS': Destination BGP autonomous system number
'INPUT_SNMP': Input interface index
'OUTPUT_SNMP': Output interface index
'IPV4_SRC_ADDR': IPv4 source address
'MAC': MAC address of the source
# Additional data
'category': IoT or non-IoT
'type': IoT, access_point, smartphone, laptop
'date': Datepart of FIRST_SWITCHED
'inter_arrival_time': Time (seconds) between successive flows of the same device (identified by its MAC address)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We are publishing a dataset we created for designing a brute-force detector of attacks in HTTPS. The dataset consists of extended network flows that we captured with flow exporter Ipifixprobe. Apart from traditional fields like source and destination IP addresses and ports, each flow contains information (size, direction, inter-packet time, TCP flags) about up to the first 100 packets. The sizes of packets are taken from the transport layer (TCP, UPD); packets with zero payload (e.g., TCP ACKs) are ignored.
We publish three files:
flows.csv, which contains raw flow data.
aggregated_flows.csv, which contains aggregated flows
samples.csv, which contains samples with extracted features. This data can be used for training a machine-learning classification model.
All IP addresses, source ports, TLS SNIs are sha256-hashed. Column CLASS is 0 for benign samples and 1 for brute-force samples.
Brute-force data The brute-force data were generated with three popular attack tools - Ncrack, Thc-hydra, and Patator. Attacks were performed against these applications:
WordPress
Joomla
MediaWiki
Ghost
Grafana
Discourse
PhpBB
OpenCart
Redmine
Nginx
Apache
The SCENARIO columns indicate which tool and application were used to generate the sample.
Benign data Bening data consists of eight captures from a backbone network. The SCENARIO column indicates individual captures.
Measure and Map Access to Grocery StoresFrom the perspective of the people living in each neighborhood How do people in your city get to the grocery store? The answer to that question depends on the person and where they live. This collection of layers, maps and apps help answer the question.Some live in cities and stop by a grocery store within a short walk or bike ride of home or work. Others live in areas where car ownership is more prevalent, and so they drive to a store. Some do not own a vehicle, and rely on a friend or public transit. Others rely on grocery delivery for their needs. And, many live in rural areas far from town, so a trip to a grocery store is an infrequent event involving a long drive.This map from Esri shows which areas are within a ten minute walk (in green) or ten minute drive (in blue) of a grocery store in the United States and Puerto Rico. Darker color indicates access to more stores. Summarizing this data shows that 20% of U.S. population live within a 10 minute walk of a grocery store, and 90% of the population live within a 10 minute drive of a grocery store. Click on the map to see a summary for each state.Every census block is scored with a count of walkable and drivable stores nearby, making this a map suitable for a dashboard for any city, or any of the 50 states, DC and Puerto Rico. Two colorful layers visualize this definition of access, one for walkable access (suitable for looking at a city neighborhood by neighborhood) and one for drivable access (suitable for looking across a city, county, region or state).On the walkable layer, shades of green define areas within a ten minute walk of one or more grocery stores. The colors become more intense and trend to a blue-green color for the busiest neighborhoods, such as downtown San Francisco. As you zoom in, a layer of Census block points visualizes the local population with or without walkable access. As you zoom out to see the entire city, the map adds a light blue - to dark blue layer, showing which parts of the region fall within ten minutes' drive of one or more grocery stores. As a result, the map is useful at all scales, from national to regional, state and local levels. It becomes easier to spot grocery stores that sit within a highly populated area, and grocery stores that sit in a shopping center far away from populated areas. This view of a city begins to hint at the question: how many people have each type of access to grocery stores? And, what if they are unable to walk a mile regularly, or don't own a car? How to Use This MapUse this map to introduce the concepts of access to grocery stores in your city or town. This is the kind of map where people will want to look up their home or work address to validate what the map is saying against their own experiences. The map was built with that use in mind. Many maps of access use straight-line, as-the-crow-flies distance, which ignores real-world barriers to walkability like rivers, lakes, interstates and other characteristics of the built environment. Block analysis using a network data set and Origin-Destination analysis factors these barriers in, resulting in a more realistic depiction of access. There is data behind the map, which can be summarized to show how many people have walkable access to local grocery stores. The map includes a feature layer of population in Census block points, which are visible when you zoom in far enough. This feature layer of Census block centroids can be plugged into an app like this one that summarizes the population with/without walkable or drivable access. Lastly, this map can serve as backdrop to other community resources, like food banks, farmers markets (example), and transit (example). Add a transit layer to immediately gauge its impact on the population's grocery access. You can also use this map to see how it relates to communities of concern. Add a layer of any block group or tract demographics, such as Percent Senior Population (examples), or Percent of Households with Access to 0 Vehicles (examples). The map is a useful visual and analytic resource for helping community leaders, business and government leaders see their town from the perspective of its residents, and begin asking questions about how their community could be improved. Data sourcesPopulation data is from the 2020 U.S. Census blocks. Each census block has a count of stores within a 10 minute walk, and a count of stores within a ten minute drive. Census blocks known to be unpopulated are given a score of 0. The layer is available as a hosted feature layer. Grocery store locations are from SafeGraph, reflecting what was in the data as of September 2024. For this project, ArcGIS StreetMap Premium was used for the street network in the origin-destination analysis work, because it has the necessary attributes on each street segment to identify which streets are considered walkable, and supports a wide variety of driving parameters. The walkable access layer and drivable access layers are rasters, whose colors were chosen to allow the drivable access layer to serve as backdrop to the walkable access layer. Data PreparationArcGIS Network Analyst was used to set up a network street layer for analysis. ArcGIS StreetMap Premium was installed to a local hard drive and selected in the Origin-Destination workflow as the network data source. This allows the origins (Census block centroids) and destinations (SafeGraph grocery stores) to be connected to that network, to allow origin-destination analysis. The Census blocks layer contains the centroid of each Census block. The data allows a simple popup to be created. This layer's block figures can be summarized further, to tract, county and state levels. The SafeGraph grocery store locations were provided by SafeGraph. The source data included NAICS code 445110 and 452311 as an initial screening. The CSV file was imported using the Data Interoperability geoprocessing tools in ArcGIS Pro, where a definition query was applied to the layer to exclude any records that were not grocery stores. The final layer used in the analysis had approximately 63,000 records. In this map, this layer is included as a vector tile layer. MethodologyEvery census block in the U.S. was assigned two access scores, whose numbers are simply how many grocery stores are within a 10 minute walk and a 10 minute drive of that census block. Every census block has a score of 0 (no stores), 1, 2 or more stores. The count of accessible stores was determined using Origin-Destination Analysis in ArcGIS Network Analyst, in ArcGIS Pro. A set of Tools in this ArcGIS Pro package allow a similar analysis to be conducted for any city or other area. The Tools step through the data prep and analysis steps. Download the Pro package, open it and substitute your own layers for Origins and Destinations. Parcel centroids are a suggested option for Origins, for example. Origin-Destination analysis was configured, using ArcGIS StreetMap Premium as the network data source. Census block centroids with population greater than zero were used as the Origins, and grocery store locations were used as the Destinations. A cutoff of 10 minutes was used with the Walk Time option. Only one restriction was applied to the street network: Walkable, which means Interstates and other non-walkable street segments were treated appropriately. You see the results in the map: wherever freeway overpasses and underpasses are present near a grocery store, the walkable area extends across/through that pass, but not along the freeway. A cutoff of 10 minutes was used with the Drive Time option. The default restrictions were applied to the street network, which means a typical vehicle's access to all types of roads was factored in. The results for each analysis were captured in a Lines layer, which shows which origins are within the 10 minute cutoff of each destination over the street network, given the assumptions about that network (walking, or driving a vehicle). The Lines layer is not published but is used to count how many stores each origin has access to over the road network. The Lines layer was then summarized by census block ID to capture the Maximum value of the Destination_Rank field. A census block within 10 minutes of 3 stores would have 3 records in the Lines layer, but only one value in the summarized table, with a MAX_Destination_Rank field value of 3. This is the number of stores accessible to that census block in the 10 minutes measured, for walking and driving. These data were joined to the block centroids layer and given unique names. At this point, all blocks with zero population or null values in the MAX_Destination_Rank fields were given a store count of 0, to help the next step. Walkable and Drivable areas are calculated into a raster layer, using Nearest Neighbor geoprocessing tool on the count of stores within a 10 minute walk, and a count of stores within a ten minute drive, respectively. This tool used a 100 meter grid and interpolates the values between each census block. A census tracts layer containing all water polygons "erased" from the census tract boundaries was used as an environment setting, to help constrain interpolation into/across bodies of water. The same layer use used to "shoreline" the Nearest Neighbor results, to eliminate any interpolation into the ocean or Great Lakes. This helped but was not perfect. Notes and LimitationsThe map provides a baseline for discussing access to grocery stores in a city. It does not presume local population has the desire or means to walk or drive to obtain groceries. It does not take elevation gain or loss into account. It does not factor time of day nor weather, seasons, or other variables that affect a person's commute choices. Walking and driving are just two ways people get to a grocery store. Some people ride a bike, others take public transit, have groceries delivered, or rely on a friend with a vehicle.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888. We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo. The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies. Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following list provides per-week flow count, capture period, and uncompressed size:
W-2022-44
Uncompressed Size: 19 GB Capture Period: 31.10.2022 - 6.11.2022 Number of flows: 32.6M W-2022-45
Uncompressed Size: 25 GB Capture Period: 7.11.2022 - 13.11.2022 Number of flows: 42.6M W-2022-46
Uncompressed Size: 20 GB Capture Period: 14.11.2022 - 20.11.2022 Number of flows: 33.7M W-2022-47
Uncompressed Size: 25 GB Capture Period: 21.11.2022 - 27.11.2022 Number of flows: 44.1M CESNET-QUIC22
Uncompressed Size: 89 GB Capture Period: 31.10.2022 - 27.11.2022 Number of flows: 153M
Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications. Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip. Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections. Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following list describes flow data fields in CSV files:
ID: Unique identifier SRC_IP: Source IP address DST_IP: Destination IP address DST_ASN: Destination Autonomous System number SRC_PORT: Source port DST_PORT: Destination port PROTOCOL: Transport protocol QUIC_VERSION QUIC: protocol version QUIC_SNI: Server Name Indication domain QUIC_USER_AGENT: User agent string, if available in the QUIC Initial Packet TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff DURATION: Duration of the flow in seconds BYTES: Number of transmitted bytes from client to server BYTES_REV: Number of transmitted bytes from server to client PACKETS: Number of packets transmitted from client to server PACKETS_REV: Number of packets transmitted from server to client PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]] PPI_LEN: Number of packets in the PPI sequence PPI_DURATION: Duration of the PPI sequence in seconds PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence PHIST_SRC_SIZES: Histogram of packet sizes from client to server PHIST_DST_SIZES: Histogram of packet sizes from server to client PHIST_SRC_IPT: Histogram of inter-packet times from client to server PHIST_DST_IPT: Histogram of inter-packet times from server to client APP: Web service label CATEGORY: Service category FLOW_ENDREASON_IDLE: Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER: Flow was terminated for other reasons
Link to other CESNET datasets
https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/ https://github.com/CESNET/cesnet-datazoo Please cite the original data article:
@article{CESNETQUIC22, author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška}, title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines}, journal = {Data in Brief}, pages = {108888}, year = {2023}, issn = {2352-3409}, doi = {https://doi.org/10.1016/j.dib.2023.108888}, url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To cite the dataset please reference it as “Stratosphere Laboratory. A labeled dataset with malicious and benign IoT network traffic. January 22th. Agustin Parmisano, Sebastian Garcia, Maria Jose Erquiaga. https://www.stratosphereips.org/datasets-iot23
This dataset includes labels that explain the linkages between flows connected with harmful or possibly malicious activity to provide network malware researchers and analysts with more thorough information. These labels were painstakingly created at the Stratosphere labs using malware capture analysis.
We present a concise explanation of the labels used for the identification of malicious flows, based on manual network analysis, below:
Attack: This label signifies the occurrence of an attack originating from an infected device directed towards another host. Any flow that endeavors to exploit a vulnerable service, discerned through payload and behavioral analysis, falls under this classification. Examples include brute force attempts on telnet logins or header-based command injections in GET requests.
Benign: The "Benign" label denotes connections where no suspicious or malicious activities have been detected.
C&C (Command and Control): This label indicates that the infected device has established a connection with a Command and Control server. This observation is rooted in the periodic nature of connections or activities such as binary downloads or the exchange of IRC-like or decoded commands.
DDoS (Distributed Denial of Service): "DDoS" is assigned when the infected device is actively involved in a Distributed Denial of Service attack, identifiable by the volume of flows directed towards a single IP address.
FileDownload: This label signifies that a file is being downloaded to the infected device. It is determined by examining connections with response bytes exceeding a specified threshold (typically 3KB or 5KB), often in conjunction with known suspicious destination ports or IPs associated with Command and Control servers.
HeartBeat: "HeartBeat" designates connections where packets serve the purpose of tracking the infected host by the Command and Control server. Such connections are identified through response bytes below a certain threshold (typically 1B) and exhibit periodic similarities. This is often associated with known suspicious destination ports or IPs linked to Command and Control servers.
Mirai: This label is applied when connections exhibit characteristics resembling those of the Mirai botnet, based on patterns consistent with common Mirai attack profiles.
Okiru: Similar to "Mirai," the "Okiru" label is assigned to connections displaying characteristics of the Okiru botnet. The parameters for this label are the same as for Mirai, but Okiru is a less prevalent botnet family.
PartOfAHorizontalPortScan: This label is employed when connections are involved in a horizontal port scan aimed at gathering information for potential subsequent attacks. The labeling decision hinges on patterns such as shared ports, similar transmitted byte counts, and multiple distinct destination IPs among the connections.
Torii: The "Torii" label is used when connections exhibit traits indicative of the Torii botnet, with labeling criteria similar to those used for Mirai, albeit in the context of a less common botnet family.
Field Name | Description | Type |
---|---|---|
ts | The timestamp of the connection event. | time |
uid | A unique identifier for the connection. | string |
id.orig_h | The source IP address. | addr |
id.orig_p | The source port. | port |
id.resp_h | The destination IP address. | addr |
id.resp_p | The destination port. | port |
proto | The network protocol used (e.g., 'tcp'). | enum |
service | The service associated with the connection. | string |
duration | The duration of the connection. | interval |
orig_bytes | The number of bytes sent from the source to the destination. | count |
resp_bytes | The number of bytes sent from the destination to the source. | count |
conn_state | The state of the connection. | string |
local_orig | Indicates whether the connection is considered local or not. | bool |
local_resp | Indicates whether the connection is considered... |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here, we provide the research community with a data set for the buffering delays that data packets experience at TCP sending side in the realm of Cyber-Physical Systems (CPSs). We focus on the buffering that occurs at the sender side due to the the adverse interaction between the Nagle algorithm and the delayed acknowledgement algorithm, which both were originally introduced into TCP to prevent sending out many small-sized packets over the network.
The data set is collected using four real-life operating systems: Windows, Linux, FreeBSD, and QNX (a real-time operating system). In each scenario, there are three separate different (virtual) machines running various operating systems. One machine, or an end-host, acts a data source, another acts as a data sink, and a third acts a network emulator that introduces artificial propagation delays between the source and the destination.
To measure buffering delay at the sender side, we record for each sent packet the two time instants: when the packet is first generated at the application layer, and when it is actually sent on the physical network. In each case, 10 different independent experiment replications/runs are executed.
Here, we provide the full distribution of all delay samples represented by the cumulative distribution function (CDF).
The data exhibited here gives an impression of the amount and scale of the delay occurring at sender-side in TCP. More importantly, the data can be used investigate to what degree these delays affect the performance of cyber-physical systems or other real-time applications employing TCP.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In a city-scale network, trips are made in thousands of origin-destination (OD) pairs connected by multiple routes, resulting in a large number of alternatives with diverse characteristics that influence the route choice behaviour of the travellers. As a consequence, to accurately predict user choices at full network scale, a route choice model should be scalable to suit all possible configurations that may be encountered. In this article, a new methodology to obtain such a model is proposed. The main idea is to use clustering analysis to obtain a small set of representative OD pairs and routes that can be investigated in detail through computer route choice experiments to collect observations on travellers behaviour. The results are then scaled-up to all other OD pairs in the network. It was found that 9 OD pair configurations are sufficient to represent the network of Lyon, France, composed of 96,096 OD pairs and 559,423 routes. The observations, collected over these nine representative OD pair configurations, were used to estimate three mixed logit models. The predictive accuracy of the three models was tested against the predictive accuracy of the same models (with the same specification), but estimated over randomly selected OD pair configurations. The obtained results show that the models estimated with the representative OD pairs are superior in predictive accuracy, thus suggesting the scaling-up to the entire network of the choices of the participants over the representative OD pair configurations, and validating the methodology in this study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Overview:
This dataset was synthetically generated to support research on crowd management, predictive risk analysis, and smart path optimization, specifically for the scenario of the Arbaeen pilgrimage in Iraq. The data simulates various environmental, behavioral, and geographical factors to create a realistic environment for testing and validating crowd analysis models. The primary goal of this dataset is to provide a basis for developing and evaluating intelligent systems that can enhance the safety and efficiency of crowd movements during large-scale pilgrimages. This dataset was specifically generated to support the findings of our research paper, titled "Predictive Risk Analysis for the Arbaeen Pilgrimage Crowds" which has been submitted for publication.
Methodology:
The dataset was created using a custom Python script that employs a hierarchical generation strategy:
Base Network: Predefined, real-world routes (such as the traditional Baghdad-Karbala path) were established as a guaranteed core.
Local Network: A dense, realistic local network was built by systematically connecting each geographical node (both real cities and synthetic points) to its nearest neighbors.
Highway Network: A higher-level network was constructed by connecting only the major, real-world cities to each other, simulating main travel arteries.
Data Attributes: For each area, attributes such as visitor count, pressure, weather, and the presence of barriers or events were generated based on a set of rules to simulate realistic conditions. The final Risk_Degree for each area is a calculated metric based on these attributes.
Dataset Contents:
The dataset is provided as a single, comprehensive CSV file: artificial generated dataset for crowd.csv.
This file is ready for direct use and contains all the necessary data used in our study, fully merged into one table. It consists of 5,000 unique records, where each record represents a connection between two areas ("from" and "to"). Each row includes the following detailed information for both the origin and destination points:
Route Information: The specific from_area and to_area for each path segment.
Area Attributes: Key metrics such as Visitors, Pressure, Speed, and environmental factors like Weather and Event.
Calculated Risk Metrics: The final Risk_Degree and Actual_Behavior classification for each area.
Geographical Coordinates: The Latitude and Longitude for each area.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models. The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation. To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files. In order to extract the data, follow the following instructions: Download and install bzip2 (if not already installed) from the official website or your package manager. Place the compressed dataset file in a directory of your choice. Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located. Execute the following command to uncompress the dataset: bzip2 -d filename.bz2 Replace "filename.bz2" with the actual name of the compressed dataset file. Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage. The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted. Feature Description Example Value ip.src Source IP address in the packet a05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17 ip.dst Destination IP address in the packet a52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5 frame.time_epoch Epoch time of the frame 1676165569.930869 arp.hw.type Hardware type 1 arp.hw.size Hardware size 6 arp.proto.size Protocol size 4 arp.opcode Opcode 2 data.len Length 2713 eth.dst.lg Destination LG bit 1 eth.dst.ig Destination IG bit 1 eth.src.lg Source LG bit 1 eth.src.ig Source IG bit 1 frame.offset_shift Time shift for this packet 0 frame.len frame length on the wire 1208 frame.cap_len Frame length stored into the capture file 215 frame.marked Frame is marked 0 frame.ignored Frame is ignored 0 frame.encap_type Encapsulation type 1 gre Generic Routing Encapsulation 'Generic Routing Encapsulation (IP)’ ip.version Version 6 ip.hdr_len Header length 24 ip.dsfield.dscp Differentiated Services Codepoint 56 ip.dsfield.ecn Explicit Congestion Notification 2 ip.len Total length 614 ip.flags.rb Reserved bit 0 ip.flags.df Don't fragment 1 ip.flags.mf More fragments 0 ip.frag_offset Fragment offset 0 ip.ttl Time to live 31 ip.proto Protocol 47 ip.checksum.status Header checksum status 2 tcp.srcport TCP source port 53425 tcp.flags Flags 0x00000098 tcp.flags.ns Nonce 0 tcp.flags.cwr Congestion Window Reduced (CWR) 1 udp.srcport UDP source port 64413 udp.dstport UDP destination port 54087 udp.stream Stream index 1345 udp.length Length 225 udp.checksum.status Checksum status 3 packet_type Type of the packet which is either "benign" or "malicious" 0 Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes. Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain. By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context
The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.
Content :
This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.
The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).
Dataset Columns:
No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance
Acknowledgements :
I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.
Ravikumar Gattu , Susmitha Choppadandi
Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).
**Dataset License: ** CC0: Public Domain
Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
ML techniques benefits from this Dataset :
This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :
Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.
Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.
3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.