32 datasets found

Network Traffic Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravikumar Gattu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

Content :

This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

Dataset Columns:

No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

Acknowledgements :

I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

Ravikumar Gattu , Susmitha Choppadandi

Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

**Dataset License: ** CC0: Public Domain

Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

ML techniques benefits from this Dataset :

This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
f
Counterpart Paths: Example paths, comparison network, and SCPPOD Output
figshare.com
7z
Updated Dec 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Matisziw (2020). Counterpart Paths: Example paths, comparison network, and SCPPOD Output [Dataset]. http://doi.org/10.6084/m9.figshare.12602771.v1
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12602771.v1
Dataset updated
Dec 19, 2020
Dataset provided by
figshare
Authors
Timothy Matisziw
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
InputData DirectoryThis network dataset is an example of a network to which paths from other networks (i.e. Networks A and B) can be compared.Contains two directories:a) NetworkCb) NetworkPaths'NetworkC' Directory- This network is based upon a subset of the Missouri Department of Transportation (MoDOT) July 2016 road dataset listed in the references.- NetworkC contains an ESRI .gdb (NetworkCdata.gdb) in which the arcs and nodes for Network C can be found as well as an ArcGIS ND Network Analyst configuration file. - Featuredataset: NetworkCsub - Network arcs: NetworkCsub - Network file: NetworkCsub_ND - Network junctions: NetworkCsub_ND_Junctions'NetworkPaths' contains ESRI .gdbs representing:a) A collection of routes between OD pairs in each network (InputPaths.gdb) - The densified routes used in the application (densified at 10m): (Net_A_routelines; Net_B_routelines; Net_C_routelines) - The original routes with original set of vertices (non densified): (Net_A_routes; Net_B_routes; Net_C_routes)b) The origin and destination points for the paths (ODNodes.gdb) - These were used to generate the shortest paths for each network, serving as the paths to be compared - origins: originLocations - destinations: destinationLocations_'OutputData' DirectoryContains the comparisons of paths to networks:NetAToB: comparison of paths from network A to network BNetAToC: comparison of paths from network A to network CNetBToA: comparison of paths from network B to network ANetBToC: comparison of paths from network B to network CNetCToA: comparison of paths from network C to network ANetCToB: comparison of paths from network C to network BInside each directory is a collection of ESRI .gdb which contains the individual paths used in the analysis as inputa) NetworkAPaths.gdbb) NetworkBPaths.gdbc) NetworkCPaths.gdbInside each directory is a collection of ESRI .gdb which contains the vertices of the individual paths used in the analysis as inputa) NetworkAPathPoints.gdbb) NetworkBPathPoints.gdbc) NetworkCPathPoints.gdbAlso included is a collection of ESRI .gdb that represent the original path nodes that could be assigned to the comparison network. In this case, only nodes that were within 20m of the comparison network could be assigned. Each path node is attributed with the distance to its counterpart node in the comparison. a) Nodes in Network A paths assigned to Network B (PathANodesAssignedtoNetB.gdb)b) Nodes in Network A paths assigned to Network C (PathANodesAssignedtoNetC.gdb)c) Nodes in Network B paths assigned to Network A (PathBNodesAssignedtoNetA.gdb)d) Nodes in Network B paths assigned to Network C (PathBNodesAssignedtoNetC.gdb)e) Nodes in Network C paths assigned to Network A (PathCNodesAssignedtoNetA.gdb)f) Nodes in Network C paths assigned to Network B (PathCNodesAssignedtoNetB.gdb)Inside each directory is a collection of ESRI .gdb which contain solutions to the SCPPOD with the following naming convention:a) comparing paths in Network A to Network B SCCPODarcsPathAtoNetB.gdb for arc elements and SCCPODnodesPathAtoNetB.gdb for node elements) - The naming convention for the node solutions for path id X is ('SN_routeX_X') - The naming convention for the arc solutions for path id X is ('routX_Rt' for single polyline counterpart path; and 'routeX_Rtsplit' for a polyline representation of the counterpart path based upon the SCPPOD node output).b) comparing paths in Network A to Network C SCCPODarcsPathAtoNetC.gdb for arc elements and SCCPODnodesPathAtoNetC.gdb for node elements)c) comparing paths in Network B to Network A SCCPODarcsPathBtoNetA.gdb for arc elements and SCCPODnodesPathBtoNetA.gdb for node elements)d) comparing paths in Network B to Network C SCCPODarcsPathBtoNetC.gdb for arc elements and SCCPODnodesPathBtoNetC.gdb for node elements)e) comparing paths in Network C to Network A SCCPODarcsPathCtoNetA.gdb for arc elements and SCCPODnodesPathCtoNetA.gdb for node elements)f) comparing paths in Network C to Network B SCCPODarcsPathCtoNetB.gdb for arc elements and SCCPODnodesPathCtoNetB.gdb for node elements)The counterpart paths that were identified were then linked to the full network C to summarize the frequency with with arcs were associated with paths - Can be found in: 1. PathARepresentationinNetC.gdb 2. PathARepresentationinNetC.gdb - important attributes: a) vcntarc: number of paths utilizing arc b) ptCnt: number of path vertices associated with each arc c) AvgDist: average distance of path vertices from network arcs d) MinDist: minimum distance of path vertices from network arcs e) MaxDist: minimum distance of path vertices from network arcs
Network Traffic Data-Malicious Activity Detection
kaggle.com
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Advait Nandakumar Menon (2024). Network Traffic Data-Malicious Activity Detection [Dataset]. https://www.kaggle.com/datasets/advaitnmenon/network-traffic-data-malicious-activity-detection/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Advait Nandakumar Menon
Description
Documentation for Network Traffic Dataset

Dataset Overview

This dataset consists of network traffic captured from a Kali Linux machine, aimed at helping the development and evaluation of machine learning models for distinguishing between normal and malicious (specifically flood attack) network activities. It includes a variety of features essential for identifying potential cybersecurity threats alongside labels indicating whether each packet is part of flood traffic.

Data Collection Methodology

The dataset was carefully compiled using network traffic captured from a dedicated Kali Linux setup. The capture environment consisted of a Kali Linux machine configured to generate and capture both normal and malicious network traffic and a target machine running a Windows OS to simulate a real-world network environment.

Traffic Generation:

Normal Traffic: Involved routine network activities such as web browsing and pinging between the Kali Linux machine and the Windows machine.

Malicious Traffic: Utilized hping3 to simulate flood attacks, specifically ICMP flood attacks, targeting the Windows machine from the Kali Linux machine [1].

Capture Process: Wireshark was used on the Kali Linux machine to capture all incoming and outgoing network traffic [2]. The capture was set up to record detailed packet information, including timestamps, source and destination IP addresses, ports, and protocols. The captures were conducted with careful monitoring to precisely mark the start and end times of the flood attack for accurate dataset labeling.

Dataset Description

The dataset is a CSV file containing a comprehensive collection of network traffic packets labeled to distinguish between normal and malicious traffic. It includes the following columns:

Timestamp: The capture time of each packet, providing insights into the traffic flow and enabling analysis of traffic patterns over time. Source IP Address: Identifies the origin of the packet, crucial for pinpointing potential sources of attacks. Destination IP Address: Indicates the packet's intended recipient, useful for identifying targeted resources. Source Port and Destination Port: Offer insights into the services involved in the communication. Protocol: Specifies the protocol used, such as TCP, UDP, or ICMP, essential for analyzing the nature of the traffic. Length: The size of the packet in bytes, which can signal unusual traffic patterns often associated with malicious activities. bad_packet: A binary label with 1 indicating traffic identified as part of a flood attack and 0 denoting normal traffic. Precise timestamps marking the start and end of flood attacks were used to accurately label this column. Packets captured within these defined intervals were marked as malicious (bad_packet = 1), whereas all others were considered normal traffic. Python and Pandas were used for the labeling process [3][4].

Potential Applications

a. Intrusion Detection Systems (IDS): The dataset can be used in training models to enhance IDS capabilities, enabling more effective detection of flood-based network attacks. b. Network Traffic Monitoring: Tools making use of machine learning can leverage the dataset for more accurate network traffic monitoring, identifying and alerting suspicious activities in real time. c. Cybersecurity Training: Educational institutions and training programs can use the dataset to provide practical experience in machine learning-based threat detection.

Proposed Machine Learning Technique: Supervised Machine Learning, specifically Deep Learning with Convolutional Neural Networks (CNNs).

CNNs, even though it is usually used for image processing, have shown promise in analyzing sequential data. The spatial hierarchy in network packets (from individual bytes to overall packet structure) can be analogous to the patterns CNNs excel at identifying. Utilizing CNNs could allow for the extraction of complex data in network traffic that indicate malicious activities, improving detection accuracy beyond traditional methods.

Conclusion

This dataset represents a significant step towards using machine learning for cybersecurity, specifically in the field of intrusion detection and network monitoring. By providing a detailed and accurately labeled dataset of normal and malicious network traffic, it lays the groundwork for developing complex models capable of identifying and mitigating flood attacks in real-time. In the future, we could include a broader range of attack types and more traffic patterns, further enhancing the dataset's utility and the effectiveness of models trained on it.

References [1] https://linux.die.net/man/8/hping3 [2] https://www.wireshark.org/docs/ [3] https://pandas.pydata.org/docs/ [4] https://docs.python.org/3/tutorial/index.html
Helsinki Region Travel Time Matrix
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henrikki Tenkanen; Henrikki Tenkanen; Tuuli Toivonen; Tuuli Toivonen (2020). Helsinki Region Travel Time Matrix [Dataset]. http://doi.org/10.5281/zenodo.3247564
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3247564
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Henrikki Tenkanen; Henrikki Tenkanen; Tuuli Toivonen; Tuuli Toivonen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Helsinki, Helsinki metropolitan area
Description
Helsinki Region Travel Time Matrix contains travel time and distance information for routes between all 250 m x 250 m grid cell centroids (n = 13231) in the Helsinki Region, Finland by walking, cycling, public transportation and car. The grid cells are compatible with the statistical grid cells used by Statistics Finland and the YKR (yhdyskuntarakenteen seurantajärjestelmä) data set. The Helsinki Region Travel Time Matrix is available for three different years:

2018

2015

2013

The data consists of travel time and distance information of the routes that have been calculated between all statistical grid cell centroids (n = 13231) by walking, cycling, public transportation and car.

The data have been calculated for two different times of the day: 1) midday and 2) rush hour.

The data may be used freely (under Creative Commons 4.0 licence). We do not take any responsibility for any mistakes, errors or other deficiencies in the data.

Organization of data

The data have been divided into 13231 text files according to destinations of the routes. The data files have been organized into sub-folders that contain multiple (approx. 4-150) Travel Time Matrix result files. Individual folders consist of all the Travel Time Matrices that have same first four digits in their filename (e.g. 5785xxx).

In order to visualize the data on a map, the result tables can be joined with the MetropAccess YKR-grid shapefile (attached here). The data can be joined by using the field ‘from_id’ in the text files and the field ‘YKR_ID’ in MetropAccess-YKR-grid shapefile as a common key.

Data structure

The data have been divided into 13231 text files according to destinations of the routes. One file includes the routes from all statistical grid cells to a particular destination grid cell. All files have been named according to the destination grid cell code and each file includes 13231 rows.

NODATA values have been stored as value -1.

Each file consists of 17 attribute fields: 1) from_id, 2) to_id, 3) walk_t, 4) walk_d, 5) bike_f_t, 6) bike_s_t, 7) bike_d, 8) pt_r_tt, 9) pt_r_t, 10) pt_r_d, 11) pt_m_tt, 12) pt_m_t, 13) pt_m_d, 14) car_r_t, 15) car_r_d, 16) car_m_t, 17) car_m_d, 18) car_sl_t

The fields are separated by semicolon in the text files.

Attributes

from_id: ID number of the origin grid cell

to_id: ID number of the destination grid cell

walk_t: Travel time in minutes from origin to destination by walking

walk_d: Distance in meters of the walking route

bike_f_t: Total travel time in minutes from origin to destination by fast cycling; Includes extra time (1 min) that it takes to take/return bike

bike_s_t: Total travel time in minutes from origin to destination by slow cycling; Includes extra time (1 min) that it takes to take/return bike

bike_d:Distance in meters of the cycling route

pt_r_tt: Travel time in minutes from origin to destination by public transportation in rush hour traffic; whole travel chain has been taken into account including the waiting time at home

pt_r_t: Travel time in minutes from origin to destination by public transportation in rush hour traffic; whole travel chain has been taken into account excluding the waiting time at home

pt_r_d: Distance in meters of the public transportation route in rush hour traffic

pt_m_tt: Travel time in minutes from origin to destination by public transportation in midday traffic; whole travel chain has been taken into account including the waiting time at home

pt_m_t: Travel time in minutes from origin to destination by public transportation in midday traffic; whole travel chain has been taken into account excluding the waiting time at home

pt_m_d: Distance in meters of the public transportation route in midday traffic

car_r_t: Travel time in minutes from origin to destination by private car in rush hour traffic; the whole travel chain has been taken into account

car_r_d: Distance in meters of the private car route in rush hour traffic

car_m_t: Travel time in minutes from origin to destination by private car in midday traffic; the whole travel chain has been taken into account

car_m_d: Distance in meters of the private car route in midday traffic

car_sl_t: Travel time from origin to destination by private car following speed limits without any additional impedances; the whole travel chain has been taken into account

METHODS

For detailed documentation and how to reproduce the data, see HelsinkiRegionTravelTimeMatrix2018 GitHub repository.

THE ROUTE BY CAR have been calculated with a dedicated open source tool called DORA (DOor-to-door Routing Analyst) developed for this project. DORA uses PostgreSQL database with PostGIS extension and is based on the pgRouting toolkit. MetropAccess-Digiroad (modified from the original Digiroad data provided by Finnish Transport Agency) has been used as a street network in which the travel times of the road segments are made more realistic by adding crossroad impedances for different road classes.

The calculations have been repeated for two times of the day using 1) the “midday impedance” (i.e. travel times outside rush hour) and 2) the “rush hour impendance” as impedance in the calculations. Moreover, there is 3) the “speed limit impedance” calculated in the matrix (i.e. using speed limit without any additional impedances).

The whole travel chain (“door-to-door approach”) is taken into account in the calculations:
1) walking time from the real origin to the nearest network location (based on Euclidean distance),
2) average walking time from the origin to the parking lot,
3) travel time from parking lot to destination,
4) average time for searching a parking lot,
5) walking time from parking lot to nearest network location of the destination and
6) walking time from network location to the real destination (based on Euclidean distance).

THE ROUTES BY PUBLIC TRANSPORTATION have been calculated by using the MetropAccess-Reititin tool which also takes into account the whole travel chains from the origin to the destination:
1) possible waiting at home before leaving,
2) walking from home to the transit stop,
3) waiting at the transit stop,
4) travel time to next transit stop,
5) transport mode change,
6) travel time to next transit stop and
7) walking to the destination.

Travel times by public transportation have been optimized using 10 different departure times within the calculation hour using so called Golomb ruler. The fastest route from these calculations are selected for the final travel time matrix.

THE ROUTES BY CYCLING are also calculated using the DORA tool. The network dataset underneath is MetropAccess-CyclingNetwork, which is a modified version from the original Digiroad data provided by Finnish Transport Agency. In the dataset the travel times for the road segments have been modified to be more realistic based on Strava sports application data from the Helsinki region from 2016 and the bike sharing system data from Helsinki from 2017.

For each road segment a separate speed value was calculated for slow and fast cycling. The value for fast cycling is based on a percentual difference between segment specific Strava speed value and the average speed value for the whole Strava data. This same percentual difference has been applied to calculate the slower speed value for each road segment. The speed value is then the average speed value of bike sharing system users multiplied by the percentual difference value.

The reference value for faster cycling has been 19km/h, which is based on the average speed of Strava sports application users in the Helsinki region. The reference value for slower cycling has been 12km/, which has been the average travel speed of bike sharing system users in Helsinki. Additional 1 minute have been added to the travel time to consider the time for taking (30s) and returning (30s) bike on the origin/destination.

More information of the Strava dataset that was used can be found from the Cycling routes and fluency report, which was published by us and the city of Helsinki.

THE ROUTES BY WALKING were also calculated using the MetropAccess-Reititin by disabling all motorized transport modesin the calculation. Thus, all routes are based on the Open Street Map geometry.

The walking speed has been adjusted to 70 meters per minute, which is the default speed in the HSL Journey Planner (also in the calculations by public transportation).

All calculations were done using the computing resources of CSC-IT Center for Science (https://www.csc.fi/home).
Greater Cambridge ANPR Data: Origin to Destination Reports - Dataset -...
ckan.publishing.service.gov.uk
Updated Feb 18, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2019). Greater Cambridge ANPR Data: Origin to Destination Reports - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/greater-cambridge-anpr-data-origin-to-destination-reports1
Explore at:
Dataset updated
Feb 18, 2019
Dataset provided by
CKANhttps://ckan.org/
Area covered
Cambridge
Description
This dataset provides Origin and Destination reports derived from the Automatic Number Plate Recognition (ANPR) camera traffic survey undertaken across the Cambridge area from 10th to 17th June 2017. The aim of the survey work was to help provide a firm evidence base for future Greater Cambridge Partnership decisions, by improving our understanding of how the network is being used and the impacts of vehicle use. The Origin and Destination Reports provide information on the first and last cameras triggered on vehicle journeys across the city. Please note that the maximum trip chain duration within the reports is two hours, and that vehicles travelling ‘outbound’ past an external camera site will end that particular trip chain. The ‘Taxi’ classification includes only Hackney Carriages. Please also note that these reports are preliminary and are undergoing review. The reports may be subject to change and revisions released in the fullness of time. The Greater Cambridge Partnership team welcome your feedback. Please email us on contactus@greatercambridge.org.uk. The Trip Chain Reports (available at http://opendata.cambridgeshireinsight.org.uk/dataset/greater-cambridge-a...) provide additional detail, giving the camera survey sites triggered along vehicles’ routes across the Highway network. ****due to the extensive amount of data recorded, data collected for each day has been divided into three files. Each file contains a 'summary' worksheet for the relevant day but the data for individual cameras have been divided for each day between camera location 1-45, 46-70 and 72-96. You can view the camera locations on the 'location plan' worksheet of each file.
d
San Diego Test Data Sets
catalog.data.gov
data.transportation.gov
+1more
Updated Jun 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
US Department of Transportation (2025). San Diego Test Data Sets [Dataset]. https://catalog.data.gov/dataset/san-diego-test-data-sets
Explore at:
Dataset updated
Jun 16, 2025
Dataset provided by
US Department of Transportation
Area covered
San Diego
Description
This data set was acquired by the USDOT Data Capture and Management program. The purpose of the data set is to provide multi-modal data and contextual information (weather and incidents) that can be used to research and develop applications. Contains one full year (January – December 2010) of raw 30-second data for over 3,000 traffic detectors deployed along 1,250 lane miles of monitored roadway in San Diego. Cleaned and geographically referenced data for over 1,500 incidents and lane closures for the two sections of I-5 that experienced the greatest number of incidents during 2010. Complete trip (origin-to-destination) GPS “breadcrumbs” collected by ALK Techonologies, containing latitude/longitude, vehicle heading and speed data, and time for individual in-vehicles devices updated at 3-second intervals for over 10,000 trips taken during 2010. A digital map shape file containing ALK’s street-level network data for the San Diego Metropolitan area. And San Diego Weather data for 2010. This legacy dataset was created before data.transportation.gov and is only currently available via the attached file(s). Please contact the dataset owner if there is a need for users to work with this data using the data.transportation.gov analysis features (online viewing, API, graphing, etc.) and the USDOT will consider modifying the dataset to fully integrate in data.transportation.gov.
f
Multicriteria Wetland Network
figshare.com
7z
Updated Dec 1, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashkan Gholamialam; Timothy Matisziw (2020). Multicriteria Wetland Network [Dataset]. http://doi.org/10.6084/m9.figshare.12609404.v1
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12609404.v1
Dataset updated
Dec 1, 2020
Dataset provided by
figshare
Authors
Ashkan Gholamialam; Timothy Matisziw
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The NetworkDataFolder includes a network dataset created to investigate wetland connectivity using a multi-criteria optimization approach. Three arc attributes represent cost associated with movement on the landscape. The digital elevation model is used to compute the topographic wetness index and cost related to elevation change. The landuse/landcover layer is basis for calculating likelihood of successful traversal. The built network consists of 12 wetlands serving as origin and destination, and 1277 network arcs. For each arc there are several arc attributes as described below:ID1: reference to arc endpoint at the lower altitudeID2: reference to arc endpoint at the higher altitudeDEM1: elevation at endpoint 1DEM2: elevation at endpoint 2DEMc: elevation at midpoint of arcTWI1: topographic wetness index at endpoint 1TWI2: topographic wetness index at endpoint 2TWIc: topographic wetness index at midpoint of arcLUL1: landuse/landcover type at endpoint 1LUL2: landuse/landcover type at endpoint 2LULc: landuse/landcover type at midpoint of arcLULb: base successful traversal likelihood associated with each landuse/landcover typeLULf: final value of successful traversal likelihood (considering arc length)DIST: arc length in metersThe ImplementationScript folder includes two solution approaches to identify Pareto-optimal solutions on the frontier of objective space in our multiobjective optimization model. The exact approach constructs the full efficient set and the approximate approach estimates the supported efficient solutions. The output for two solution methods are available in the PathFolder.
T
VOLPE National Performance Monitoring Research Data Set (NPMRDS) -...
agtransport.usda.gov
csv, xlsx, xml
Updated Apr 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Volpe National Transportation Systems Center (2021). VOLPE National Performance Monitoring Research Data Set (NPMRDS) - Destination to Origin [Dataset]. https://agtransport.usda.gov/w/yf7a-wzas/_variation_?cur=QGZVJnErWmD&from=root
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Apr 15, 2021
Dataset authored and provided by
Volpe National Transportation Systems Center
Description
This dataset contains data on original and post-calibration mileposts, Traffic Message Channel location codes (TMC), Truck Travel Time Reliability Index, Travel Time Index (TTI), TMC mileage, and corridor identification segment for the destination to origin direction of the National Performance Monitoring Research Data Set (NPMRDS) network.
i
VPN-nonVPN dataset
impactcybertrust.org
Updated Jan 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
External Data Source (2019). VPN-nonVPN dataset [Dataset]. http://doi.org/10.23721/100/1478793
Explore at:
Unique identifier
https://doi.org/10.23721/100/1478793
Dataset updated
Jan 19, 2019
Authors
External Data Source
Description
To generate a representative dataset of real-world traffic in ISCX we defined a set of tasks, assuring that our dataset is rich enough in diversity and quantity. We created accounts for users Alice and Bob in order to use services like Skype, Facebook, etc. Below we provide the complete list of different types of traffic and applications considered in our dataset for each traffic type (VoIP, P2P, etc.)

We captured a regular session and a session over VPN, therefore we have a total of 14 traffic categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc. We also give a detailed description of the different types of traffic generated:

Browsing: Under this label we have HTTPS traffic generated by users while browsing or performing any task that includes the use of a browser. For instance, when we captured voice-calls using hangouts, even though browsing is not the main activity, we captured several browsing flows.

Email: The traffic samples generated using a Thunderbird client, and Alice and Bob Gmail accounts. The clients were configured to deliver mail through SMTP/S, and receive it using POP3/SSL in one client and IMAP/SSL in the other.

Chat: The chat label identifies instant-messaging applications. Under this label we have Facebook and Hangouts via web browsers, Skype, and IAM and ICQ using an application called pidgin [14].

Streaming: The streaming label identifies multimedia applications that require a continuous and steady stream of data. We captured traffic from Youtube (HTML5 and flash versions) and Vimeo services using Chrome and Firefox.

File Transfer: This label identifies traffic applications whose main purpose is to send or receive files and documents. For our dataset we captured Skype file transfers, FTP over SSH (SFTP) and FTP over SSL (FTPS) traffic sessions.

VoIP: The Voice over IP label groups all traffic generated by voice applications. Within this label we captured voice calls using Facebook, Hangouts and Skype.

TraP2P: This label is used to identify file-sharing protocols like Bittorrent. To generate this traffic we downloaded different .torrent files from a public a repository and captured traffic sessions using the uTorrent and Transmission applications.

The traffic was captured using Wireshark and tcpdump, generating a total amount of 28GB of data. For the VPN, we used an external VPN service provider and connected to it using OpenVPN (UDP mode). To generate SFTP and FTPS traffic we also used an external service provider and Filezilla as a client.

To facilitate the labeling process, when capturing the traffic all unnecessary services and applications were closed. (The only application executed was the objective of the capture, e.g., Skype voice-call, SFTP file transfer, etc.) We used a filter to capture only the packets with source or destination IP, the address of the local client (Alice or Bob).

The full research paper outlining the details of the dataset and its underlying principles:

Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy.
ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on selected features. The UNB ISCX Network Traffic (VPN-nonVPN) dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by ISCXFlowMeter) also are publicly available for researchers.

For more information contact cic@unb.ca.

The UNB ISCX Network Traffic Dataset content
Traffic: Content
Web Browsing: Firefox and Chrome
Email: SMPTS, POP3S and IMAPS
Chat: ICQ, AIM, Skype, Facebook and Hangouts
Streaming: Vimeo and Youtube
File Transfer: Skype, FTPS and SFTP using Filezilla and an external service
VoIP: Facebook, Skype and Hangouts voice calls (1h duration)
P2P: uTorrent and Transmission (Bittorrent)
; cic@unb.ca.
Network Digital Twin-Generated Dataset for Machine Learning-based Detection...
zenodo.org
zip
Updated Jun 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses (2025). Network Digital Twin-Generated Dataset for Machine Learning-based Detection of Benign and Malicious Heavy Hitter Flows [Dataset]. http://doi.org/10.5281/zenodo.14841650
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14841650
Dataset updated
Jun 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amit Karamchandani Batra; Amit Karamchandani Batra; Javier Nuñez Fuente; Luis de la Cal García; Luis de la Cal García; Yenny Moreno Meneses; Alberto Mozo Velasco; Alberto Mozo Velasco; Antonio Pastor Perales; Antonio Pastor Perales; Diego R. López; Diego R. López; Javier Nuñez Fuente; Yenny Moreno Meneses
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 11, 2024
Description
Overview

This record provides a dataset created as part of the study presented in the following publication and is made publicly available for research purposes. The associated article provides a comprehensive description of the dataset, its structure, and the methodology used in its creation. If you use this dataset, please cite the following article published in the journal IEEE Communications Magazine:

A. Karamchandani, J. Nunez, L. de-la-Cal, Y. Moreno, A. Mozo, and A. Pastor, “On the Applicability of Network Digital Twins in Generating Synthetic Data for Heavy Hitter Discrimination,” IEEE Communications Magazine, pp. 2–8, 2025, DOI: 10.1109/MCOM.003.2400648.

More specifically, the record contains several synthetic datasets generated to differentiate between benign and malicious heavy hitter flows within a realistic virtualized network environment. Heavy Hitter flows, which include high-volume data transfers, can significantly impact network performance, leading to congestion and degraded quality of service. Distinguishing legitimate heavy hitter activity from malicious Distributed Denial-of-Service traffic is critical for network management and security, yet existing datasets lack the granularity needed for training machine learning models to effectively make this distinction.

To address this, a Network Digital Twin (NDT) approach was utilized to emulate realistic network conditions and traffic patterns, enabling automated generation of labeled data for both benign and malicious HH flows alongside regular traffic.

Feature Set:

The feature set includes the following flow statistics commonly used in the literature on network traffic classification:

The protocol used for the connection, identifying whether it is TCP, UDP, ICMP, or OSPF.

The time (relative to the connection start) of the most recent packet sent from source to destination at the time of each snapshot.

The time (relative to the connection start) of the most recent packet sent from destination to source at the time of each snapshot.

The cumulative count of data packets sent from source to destination at the time of each snapshot.

The cumulative count of data packets sent from destination to source at the time of each snapshot.

The cumulative bytes sent from source to destination at the time of each snapshot.

The cumulative bytes sent from destination to source at the time of each snapshot.

The time difference between the first packet sent from source to destination and the first packet sent from destination to source.

Dataset Variations:

To accommodate diverse research needs and scenarios, the dataset is provided in the following variations:

All at Once:

Contains a synthetic dataset where all traffic types, including benign, normal, and malicious DDoS heavy hitter (HH) flows, are combined into a single dataset.

This version represents a holistic view of the traffic environment, simulating real-world scenarios where all traffic occurs simultaneously.

Balanced Traffic Generation:

Represents a balanced traffic dataset with an equal proportion of benign, normal, and malicious DDoS traffic.

Designed for scenarios where a balanced dataset is needed for fair training and evaluation of machine learning models.

DDoS at Intervals:

Contains traffic data where malicious DDoS HH traffic occurs at specific time intervals, mimicking real-world attack patterns.

Useful for studying the impact and detection of intermittent malicious activities.

Only Benign HH Traffic:

Includes only benign HH traffic flows.

Suitable for training and evaluating models to identify and differentiate benign heavy hitter traffic patterns.

Only DDoS Traffic:

Contains only malicious DDoS HH traffic.

Helps in isolating and analyzing attack characteristics for targeted threat detection.

Only Normal Traffic:

Comprises only regular, non-HH traffic flows.

Useful for understanding baseline network behavior in the absence of heavy hitters.

Unbalanced Traffic Generation:

Features an unbalanced dataset with varying proportions of benign, normal, and malicious traffic.

Simulates real-world scenarios where certain types of traffic dominate, providing insights into model performance in unbalanced conditions.

For each variation, the output of the different packet aggregators is provided separated in its respective folder.

Each variation was generated using the NDT approach to demonstrate its flexibility and ensure the reproducibility of our study's experiments, while also contributing to future research on network traffic patterns and the detection and classification of heavy hitter traffic flows. The dataset is designed to support research in network security, machine learning model development, and applications of digital twin technology.

Network traffic datasets with novel extended IP flow called NetTiSA flow

zenodo.org

csv

Updated Apr 18, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Jaroslav Pešek; Jaroslav Pešek; Tomáš Čejka; Tomáš Čejka (2024). Network traffic datasets with novel extended IP flow called NetTiSA flow [Dataset]. http://doi.org/10.5281/zenodo.8301043

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8301043

Dataset updated

Apr 18, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Josef Koumar; Josef Koumar; Karel Hynek; Karel Hynek; Jaroslav Pešek; Jaroslav Pešek; Tomáš Čejka; Tomáš Čejka

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Network traffic datasets with novel extended IP flow called NetTiSA flow

Datasets were created for the paper: NetTiSA: Extended IP Flow with Time-series Features for Universal Bandwidth-constrained High-speed Network Traffic Classification -- Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka -- which is published in The International Journal of Computer and Telecommunications Networking https://doi.org/10.1016/j.comnet.2023.110147

Please cite the usage of our datasets as:

Josef Koumar, Karel Hynek, Jaroslav Pešek, Tomáš Čejka, "NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification", Computer Networks, Volume 240, 2024, 110147, ISSN 1389-1286
@article{KOUMAR2024110147,
title = {NetTiSA: Extended IP flow with time-series features for universal bandwidth-constrained high-speed network traffic classification},
journal = {Computer Networks},
volume = {240},
pages = {110147},
year = {2024},
issn = {1389-1286},
doi = {https://doi.org/10.1016/j.comnet.2023.110147},
url = {https://www.sciencedirect.com/science/article/pii/S1389128623005923},
author = {Josef Koumar and Karel Hynek and Jaroslav Pešek and Tomáš Čejka}
}

This Zenodo repository contains 23 datasets created from 15 well-known published datasets, which are cited in the table below. Each dataset contains the NetTiSA flow feature vector.

NetTiSA flow feature vector

The novel extended IP flow called NetTiSA (Network Time Series Analysed) flow contains a universal bandwidth-constrained feature vector consisting of 20 features. We divide the NetTiSA flow classification features into three groups by computation. The first group of features is based on classical bidirectional flow information---a number of transferred bytes, and packets. The second group contains statistical and time-based features calculated using the time-series analysis of the packet sequences. The third type of features can be computed from the previous groups (i.e., on the flow collector) and improve the classification performance without any impact on the telemetry bandwidth.

Flow features

The flow features are:

Packets is the number of packets in the direction from the source to the destination IP address.
Packets in reverse order is the number of packets in the direction from the destination to the source IP address.
Bytes is the size of the payload in bytes transferred in the direction from the source to the destination IP address.
Bytes in reverse order is the size of the payload in bytes transferred in the direction from the destination to the source IP address.

Statistical and Time-based features

The features that are exported in the extended part of the flow. All of them can be computed (exactly or in approximative) by stream-wise computation, which is necessary for keeping memory requirements low. The second type of feature set contains the following features:

Mean represents mean of the payload lengths of packets
Min is the minimal value from payload lengths of all packets in a flow
Max is the maximum value from payload lengths of all packets in a flow
Standard deviation is a measure of the variation of payload lengths from the mean payload length
Root mean square is the measure of the magnitude of payload lengths of packets
Average dispersion is the average absolute difference between each payload length of the packet and the mean value
Kurtosis is the measure describing the extent to which the tails of a distribution differ from the tails of a normal distribution
Mean of relative times is the mean of the relative times which is a sequence defined as \(st = \{t_1 - t_1, t_2 - t_1, ..., t_n - t_1\} \)
Mean of time differences is the mean of the time differences which is a sequence defined as \(dt = \{ t_j - t_i | j = i + 1, i \in \{1, 2, \dots, n - 1\} \}.\)
Min from time differences is the minimal value from all time differences, i.e., min space between packets.
Max from time differences is the maximum value from all time differences, i.e., max space between packets.
Time distribution describes the deviation of time differences between individual packets within the time series. The feature is computed by the following equation:
\(tdist = \frac{ \frac{1}{n-1} \sum_{i=1}^{n-1} \left| \mu_{\{dt_{n-1}\}} - dt_i \right| }{ \frac{1}{2} \left(max\left(\{dt_{n-1}\}\right) - min\left(\{dt_{n-1}\}\right) \right) }\)
Switching ratio represents a value change ratio (switching) between payload lengths. The switching ratio is computed by equation:
\(sr = \frac{s_n}{\frac{1}{2} (n - 1)}\)

where \(s_n\) is number of switches.

Features computed at the collector
The third set contains features that are computed from the previous two groups prior to classification. Therefore, they do not influence the network telemetry size and their computation does not put additional load to resource-constrained flow monitoring probes. The NetTiSA flow combined with this feature set is called the Enhanced NetTiSA flow and contains the following features:

Max minus min is the difference between minimum and maximum payload lengths
Percent deviation is the dispersion of the average absolute difference to the mean value
Variance is the spread measure of the data from its mean
Burstiness is the degree of peakedness in the central part of the distribution
Coefficient of variation is a dimensionless quantity that compares the dispersion of a time series to its mean value and is often used to compare the variability of different time series that have different units of measurement
Directions describe a percentage ratio of packet direction computed as \(\frac{d_1}{ d_1 + d_0}\), where \(d_1\) is a number of packets in a direction from source to destination IP address and \(d_0\) the opposite direction. Both \(d_1\) and \(d_0\) are inside the classical bidirectional flow.
Duration is the duration of the flow

The NetTiSA flow is implemented into IP flow exporter ipfixprobe.

Description of dataset files

In the following table is a description of each dataset file:

File name	Detection problem	Citation of the original raw dataset
botnet_binary.csv	Binary detection of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
botnet_multiclass.csv	Multi-class classification of botnet	S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014.
cryptomining_design.csv	Binary detection of cryptomining; the design part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
cryptomining_evaluation.csv	Binary detection of cryptomining; the evaluation part	Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022
dns_malware.csv	Binary detection of malware DNS	Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021.
doh_cic.csv	Binary detection of DoH	Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020
doh_real_world.csv	Binary detection of DoH	Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022
dos.csv	Binary detection of DoS	Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019.
edge_iiot_binary.csv	Binary detection of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022.
edge_iiot_multiclass.csv	Multi-class classification of IoT malware	Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications:

IoT-deNAT: Outbound flow-based network traffic data of IoT and non-IoT...
zenodo.org
Updated Jul 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yair Meidan; Yair Meidan; Vinay Sachidananda; Vinay Sachidananda; Hongyi Peng; Racheli Sagron; Yuval Elovici; Yuval Elovici; Asaf Shabtai; Asaf Shabtai; Hongyi Peng; Racheli Sagron (2020). IoT-deNAT: Outbound flow-based network traffic data of IoT and non-IoT devices behind a home NAT [Dataset]. http://doi.org/10.5281/zenodo.3924770
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3924770
Dataset updated
Jul 23, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yair Meidan; Yair Meidan; Vinay Sachidananda; Vinay Sachidananda; Hongyi Peng; Racheli Sagron; Yuval Elovici; Yuval Elovici; Asaf Shabtai; Asaf Shabtai; Hongyi Peng; Racheli Sagron
Description
This dataset is comprised of NetFlow records, which capture the outbound network traffic of 8 commercial IoT devices and 5 non-IoT devices, collected during a period of 37 days in a lab at Ben-Gurion University of The Negev. The dataset was collected in order to develop a method for telecommunication providers to detect vulnerable IoT models behind home NATs. Each NetFlow record is labeled with the device model which produced it; for research reproducibilty, each NetFlow is also allocated to either the "training" or "test" set, in accordance with the partitioning described in:

Y. Meidan, V. Sachidananda, H. Peng, R. Sagron, Y. Elovici, and A. Shabtai, A novel approach for detecting vulnerable IoT devices connected behind a home NAT, Computers & Security, Volume 97, 2020, 101968, ISSN 0167-4048, https://doi.org/10.1016/j.cose.2020.101968. (http://www.sciencedirect.com/science/article/pii/S0167404820302418)

Please note:

The dataset itself is free to use, however users are requested to cite the above-mentioned paper, which describes in detail the research objectives as well as the data collection, preparation and analysis.

Following is a brief description of the features used in this dataset.

# NetFlow features, used in the related paper for analysis

'FIRST_SWITCHED': System uptime at which the first packet of this flow was switched
'IN_BYTES': Incoming counter for the number of bytes associated with an IP Flow
'IN_PKTS': Incoming counter for the number of packets associated with an IP Flow
'IPV4_DST_ADDR': IPv4 destination address
'L4_DST_PORT': TCP/UDP destination port number
'L4_SRC_PORT': TCP/UDP source port number
'LAST_SWITCHED': System uptime at which the last packet of this flow was switched
'PROTOCOL': IP protocol byte (6: TCP, 17: UDP)
'SRC_TOS': Type of Service byte setting when there is an incoming interface
'TCP_FLAGS': Cumulative of all the TCP flags seen for this flow

# Features added by the authors

'IP': Prefix of the destination IP address, representing the network (without the host)
'DURATION': Time (seconds) between first/last packet switching

# Label
'device_model':

# Partition
'partition': Training or test

# Additional NetFlow features (mostly zero-variance)
'SRC_AS': Source BGP autonomous system number
'DST_AS': Destination BGP autonomous system number
'INPUT_SNMP': Input interface index
'OUTPUT_SNMP': Output interface index
'IPV4_SRC_ADDR': IPv4 source address
'MAC': MAC address of the source

# Additional data
'category': IoT or non-IoT
'type': IoT, access_point, smartphone, laptop
'date': Datepart of FIRST_SWITCHED
'inter_arrival_time': Time (seconds) between successive flows of the same device (identified by its MAC address)
Z
HTTPS Brute-force dataset with extended network flows
data.niaid.nih.gov
zenodo.org
Updated Apr 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomas Cejka (2022). HTTPS Brute-force dataset with extended network flows [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4275774
Explore at:
Dataset updated
Apr 11, 2022
Dataset provided by
Jan Luxemburk
Tomas Cejka
Karel Hynek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are publishing a dataset we created for designing a brute-force detector of attacks in HTTPS. The dataset consists of extended network flows that we captured with flow exporter Ipifixprobe. Apart from traditional fields like source and destination IP addresses and ports, each flow contains information (size, direction, inter-packet time, TCP flags) about up to the first 100 packets. The sizes of packets are taken from the transport layer (TCP, UPD); packets with zero payload (e.g., TCP ACKs) are ignored.

We publish three files:

flows.csv, which contains raw flow data.

aggregated_flows.csv, which contains aggregated flows

samples.csv, which contains samples with extracted features. This data can be used for training a machine-learning classification model.

All IP addresses, source ports, TLS SNIs are sha256-hashed. Column CLASS is 0 for benign samples and 1 for brute-force samples.

Brute-force data The brute-force data were generated with three popular attack tools - Ncrack, Thc-hydra, and Patator. Attacks were performed against these applications:

WordPress Joomla MediaWiki Ghost Grafana Discourse PhpBB OpenCart Redmine Nginx Apache

The SCENARIO columns indicate which tool and application were used to generate the sample.

Benign data Bening data consists of eight captures from a backbone network. The SCENARIO column indicates individual captures.
Grocery Access Map Gallery
supply-chain-data-hub-nmcdc.hub.arcgis.com
Updated Apr 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urban Observatory by Esri (2021). Grocery Access Map Gallery [Dataset]. https://supply-chain-data-hub-nmcdc.hub.arcgis.com/datasets/UrbanObservatory::grocery-access-map-gallery
Explore at:
Dataset updated
Apr 20, 2021
Dataset provided by
Esrihttp://esri.com/
Authors
Urban Observatory by Esri
Area covered

Description
Measure and Map Access to Grocery StoresFrom the perspective of the people living in each neighborhood How do people in your city get to the grocery store? The answer to that question depends on the person and where they live. This collection of layers, maps and apps help answer the question.Some live in cities and stop by a grocery store within a short walk or bike ride of home or work. Others live in areas where car ownership is more prevalent, and so they drive to a store. Some do not own a vehicle, and rely on a friend or public transit. Others rely on grocery delivery for their needs. And, many live in rural areas far from town, so a trip to a grocery store is an infrequent event involving a long drive.This map from Esri shows which areas are within a ten minute walk (in green) or ten minute drive (in blue) of a grocery store in the United States and Puerto Rico. Darker color indicates access to more stores. Summarizing this data shows that 20% of U.S. population live within a 10 minute walk of a grocery store, and 90% of the population live within a 10 minute drive of a grocery store. Click on the map to see a summary for each state.Every census block is scored with a count of walkable and drivable stores nearby, making this a map suitable for a dashboard for any city, or any of the 50 states, DC and Puerto Rico. Two colorful layers visualize this definition of access, one for walkable access (suitable for looking at a city neighborhood by neighborhood) and one for drivable access (suitable for looking across a city, county, region or state).On the walkable layer, shades of green define areas within a ten minute walk of one or more grocery stores. The colors become more intense and trend to a blue-green color for the busiest neighborhoods, such as downtown San Francisco. As you zoom in, a layer of Census block points visualizes the local population with or without walkable access. As you zoom out to see the entire city, the map adds a light blue - to dark blue layer, showing which parts of the region fall within ten minutes' drive of one or more grocery stores. As a result, the map is useful at all scales, from national to regional, state and local levels. It becomes easier to spot grocery stores that sit within a highly populated area, and grocery stores that sit in a shopping center far away from populated areas. This view of a city begins to hint at the question: how many people have each type of access to grocery stores? And, what if they are unable to walk a mile regularly, or don't own a car? How to Use This MapUse this map to introduce the concepts of access to grocery stores in your city or town. This is the kind of map where people will want to look up their home or work address to validate what the map is saying against their own experiences. The map was built with that use in mind. Many maps of access use straight-line, as-the-crow-flies distance, which ignores real-world barriers to walkability like rivers, lakes, interstates and other characteristics of the built environment. Block analysis using a network data set and Origin-Destination analysis factors these barriers in, resulting in a more realistic depiction of access. There is data behind the map, which can be summarized to show how many people have walkable access to local grocery stores. The map includes a feature layer of population in Census block points, which are visible when you zoom in far enough. This feature layer of Census block centroids can be plugged into an app like this one that summarizes the population with/without walkable or drivable access. Lastly, this map can serve as backdrop to other community resources, like food banks, farmers markets (example), and transit (example). Add a transit layer to immediately gauge its impact on the population's grocery access. You can also use this map to see how it relates to communities of concern. Add a layer of any block group or tract demographics, such as Percent Senior Population (examples), or Percent of Households with Access to 0 Vehicles (examples). The map is a useful visual and analytic resource for helping community leaders, business and government leaders see their town from the perspective of its residents, and begin asking questions about how their community could be improved. Data sourcesPopulation data is from the 2020 U.S. Census blocks. Each census block has a count of stores within a 10 minute walk, and a count of stores within a ten minute drive. Census blocks known to be unpopulated are given a score of 0. The layer is available as a hosted feature layer. Grocery store locations are from SafeGraph, reflecting what was in the data as of September 2024. For this project, ArcGIS StreetMap Premium was used for the street network in the origin-destination analysis work, because it has the necessary attributes on each street segment to identify which streets are considered walkable, and supports a wide variety of driving parameters. The walkable access layer and drivable access layers are rasters, whose colors were chosen to allow the drivable access layer to serve as backdrop to the walkable access layer. Data PreparationArcGIS Network Analyst was used to set up a network street layer for analysis. ArcGIS StreetMap Premium was installed to a local hard drive and selected in the Origin-Destination workflow as the network data source. This allows the origins (Census block centroids) and destinations (SafeGraph grocery stores) to be connected to that network, to allow origin-destination analysis. The Census blocks layer contains the centroid of each Census block. The data allows a simple popup to be created. This layer's block figures can be summarized further, to tract, county and state levels. The SafeGraph grocery store locations were provided by SafeGraph. The source data included NAICS code 445110 and 452311 as an initial screening. The CSV file was imported using the Data Interoperability geoprocessing tools in ArcGIS Pro, where a definition query was applied to the layer to exclude any records that were not grocery stores. The final layer used in the analysis had approximately 63,000 records. In this map, this layer is included as a vector tile layer. MethodologyEvery census block in the U.S. was assigned two access scores, whose numbers are simply how many grocery stores are within a 10 minute walk and a 10 minute drive of that census block. Every census block has a score of 0 (no stores), 1, 2 or more stores. The count of accessible stores was determined using Origin-Destination Analysis in ArcGIS Network Analyst, in ArcGIS Pro. A set of Tools in this ArcGIS Pro package allow a similar analysis to be conducted for any city or other area. The Tools step through the data prep and analysis steps. Download the Pro package, open it and substitute your own layers for Origins and Destinations. Parcel centroids are a suggested option for Origins, for example. Origin-Destination analysis was configured, using ArcGIS StreetMap Premium as the network data source. Census block centroids with population greater than zero were used as the Origins, and grocery store locations were used as the Destinations. A cutoff of 10 minutes was used with the Walk Time option. Only one restriction was applied to the street network: Walkable, which means Interstates and other non-walkable street segments were treated appropriately. You see the results in the map: wherever freeway overpasses and underpasses are present near a grocery store, the walkable area extends across/through that pass, but not along the freeway. A cutoff of 10 minutes was used with the Drive Time option. The default restrictions were applied to the street network, which means a typical vehicle's access to all types of roads was factored in. The results for each analysis were captured in a Lines layer, which shows which origins are within the 10 minute cutoff of each destination over the street network, given the assumptions about that network (walking, or driving a vehicle). The Lines layer is not published but is used to count how many stores each origin has access to over the road network. The Lines layer was then summarized by census block ID to capture the Maximum value of the Destination_Rank field. A census block within 10 minutes of 3 stores would have 3 records in the Lines layer, but only one value in the summarized table, with a MAX_Destination_Rank field value of 3. This is the number of stores accessible to that census block in the 10 minutes measured, for walking and driving. These data were joined to the block centroids layer and given unique names. At this point, all blocks with zero population or null values in the MAX_Destination_Rank fields were given a store count of 0, to help the next step. Walkable and Drivable areas are calculated into a raster layer, using Nearest Neighbor geoprocessing tool on the count of stores within a 10 minute walk, and a count of stores within a ten minute drive, respectively. This tool used a 100 meter grid and interpolates the values between each census block. A census tracts layer containing all water polygons "erased" from the census tract boundaries was used as an environment setting, to help constrain interpolation into/across bodies of water. The same layer use used to "shoreline" the Nearest Neighbor results, to eliminate any interpolation into the ocean or Great Lakes. This helped but was not perfect. Notes and LimitationsThe map provides a baseline for discussing access to grocery stores in a city. It does not presume local population has the desire or means to walk or drive to obtain groceries. It does not take elevation gain or loss into account. It does not factor time of day nor weather, seasons, or other variables that affect a person's commute choices. Walking and driving are just two ways people get to a grocery store. Some people ride a bike, others take public transit, have groceries delivered, or rely on a friend with a vehicle.
Z
Data from: CESNET-QUIC22: A large one-month QUIC network traffic dataset...
data.niaid.nih.gov
zenodo.org
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hynek, Karel (2024). CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7409923
Explore at:
Dataset updated
Feb 29, 2024
Dataset provided by
Čejka, Tomáš
Luxemburk, Jan
Hynek, Karel
Šiška, Pavel
Lukačovič, Andrej
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please refer to the original data article for further data description: Jan Luxemburk et al. CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines, Data in Brief, 2023, 108888, ISSN 2352-3409, https://doi.org/10.1016/j.dib.2023.108888. We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo. The QUIC (Quick UDP Internet Connection) protocol has the potential to replace TLS over TCP, which is the standard choice for reliable and secure Internet communication. Due to its design that makes the inspection of QUIC handshakes challenging and its usage in HTTP/3, there is an increasing demand for research in QUIC traffic analysis. This dataset contains one month of QUIC traffic collected in an ISP backbone network, which connects 500 large institutions and serves around half a million people. The data are delivered as enriched flows that can be useful for various network monitoring tasks. The provided server names and packet-level information allow research in the encrypted traffic classification area. Moreover, included QUIC versions and user agents (smartphone, web browser, and operating system identifiers) provide information for large-scale QUIC deployment studies. Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for four weeks between 31.10.2022 and 27.11.2022. The following list provides per-week flow count, capture period, and uncompressed size:

W-2022-44

Uncompressed Size: 19 GB Capture Period: 31.10.2022 - 6.11.2022 Number of flows: 32.6M W-2022-45

Uncompressed Size: 25 GB Capture Period: 7.11.2022 - 13.11.2022 Number of flows: 42.6M W-2022-46

Uncompressed Size: 20 GB Capture Period: 14.11.2022 - 20.11.2022 Number of flows: 33.7M W-2022-47

Uncompressed Size: 25 GB Capture Period: 21.11.2022 - 27.11.2022 Number of flows: 44.1M CESNET-QUIC22

Uncompressed Size: 89 GB Capture Period: 31.10.2022 - 27.11.2022 Number of flows: 153M

Data description The dataset consists of network flows describing encrypted QUIC communications. Flows were created using ipfixprobe flow exporter and are extended with packet metadata sequences, packet histograms, and with fields extracted from the QUIC Initial Packet, which is the first packet of the QUIC connection handshake. The extracted handshake fields are the Server Name Indication (SNI) domain, the used version of the QUIC protocol, and the user agent string that is available in a subset of QUIC communications. Packet Sequences Flows in the dataset are extended with sequences of packet sizes, directions, and inter-packet times. For the packet sizes, we consider payload size after transport headers (UDP headers for the QUIC case). Packet directions are encoded as ±1, +1 meaning a packet sent from client to server, and -1 a packet from server to client. Inter-packet times depend on the location of communicating hosts, their distance, and on the network conditions on the path. However, it is still possible to extract relevant information that correlates with user interactions and, for example, with the time required for an API/server/database to process the received data and generate the response to be sent in the next packet. Packet metadata sequences have a length of 30, which is the default setting of the used flow exporter. We also derive three fields from each packet sequence: its length, time duration, and the number of roundtrips. The roundtrips are counted as the number of changes in the communication direction (from packet directions data); in other words, each client request and server response pair counts as one roundtrip. Flow statistics Flows also include standard flow statistics, which represent aggregated information about the entire bidirectional flow. The fields are: the number of transmitted bytes and packets in both directions, the duration of flow, and packet histograms. Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow in both directions (more information in the PHISTS plugin documentation There are eight bins with a logarithmic scale; the intervals are 0-15, 16-31, 32-63, 64-127, 128-255, 256-511, 512-1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. Moreover, each flow has its end reason - either it was idle, reached the active timeout, or ended due to other reasons. This corresponds with the official IANA IPFIX-specified values. The FLOW_ENDREASON_OTHER field represents the forced end and lack of resources reasons. The end of flow detected reason is not considered because it is not relevant for UDP connections. Dataset structure The dataset flows are delivered in compressed CSV files. CSV files contain one flow per row; data columns are summarized in the provided list below. For each flow data file, there is a JSON file with the number of saved and seen (before sampling) flows per service and total counts of all received (observed on the CESNET2 network), service (belonging to one of the dataset's services), and saved (provided in the dataset) flows. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. Flow counts before sampling can be used to compute sampling ratios of individual services and to resample the dataset back to the original service distribution. Moreover, various dataset statistics, such as feature distributions and value counts of QUIC versions and user agents, are provided in the dataset-statistics folder. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following list describes flow data fields in CSV files:

ID: Unique identifier SRC_IP: Source IP address DST_IP: Destination IP address DST_ASN: Destination Autonomous System number SRC_PORT: Source port DST_PORT: Destination port PROTOCOL: Transport protocol QUIC_VERSION QUIC: protocol version QUIC_SNI: Server Name Indication domain QUIC_USER_AGENT: User agent string, if available in the QUIC Initial Packet TIME_FIRST: Timestamp of the first packet in format YYYY-MM-DDTHH-MM-SS.ffffff TIME_LAST: Timestamp of the last packet in format YYYY-MM-DDTHH-MM-SS.ffffff DURATION: Duration of the flow in seconds BYTES: Number of transmitted bytes from client to server BYTES_REV: Number of transmitted bytes from server to client PACKETS: Number of packets transmitted from client to server PACKETS_REV: Number of packets transmitted from server to client PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]] PPI_LEN: Number of packets in the PPI sequence PPI_DURATION: Duration of the PPI sequence in seconds PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence PHIST_SRC_SIZES: Histogram of packet sizes from client to server PHIST_DST_SIZES: Histogram of packet sizes from server to client PHIST_SRC_IPT: Histogram of inter-packet times from client to server PHIST_DST_IPT: Histogram of inter-packet times from server to client APP: Web service label CATEGORY: Service category FLOW_ENDREASON_IDLE: Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE: Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER: Flow was terminated for other reasons

Link to other CESNET datasets

https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/ https://github.com/CESNET/cesnet-datazoo Please cite the original data article:

@article{CESNETQUIC22, author = {Jan Luxemburk and Karel Hynek and Tomáš Čejka and Andrej Lukačovič and Pavel Šiška}, title = {CESNET-QUIC22: a large one-month QUIC network traffic dataset from backbone lines}, journal = {Data in Brief}, pages = {108888}, year = {2023}, issn = {2352-3409}, doi = {https://doi.org/10.1016/j.dib.2023.108888}, url = {https://www.sciencedirect.com/science/article/pii/S2352340923000069} }

Malware Detection in Network Traffic Data

kaggle.com

Updated Dec 26, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Agung Pambudi (2023). Malware Detection in Network Traffic Data [Dataset]. http://doi.org/10.34740/kaggle/dsv/7285844

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/7285844

Dataset updated

Dec 26, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Agung Pambudi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

To cite the dataset please reference it as “Stratosphere Laboratory. A labeled dataset with malicious and benign IoT network traffic. January 22th. Agustin Parmisano, Sebastian Garcia, Maria Jose Erquiaga. https://www.stratosphereips.org/datasets-iot23

This dataset includes labels that explain the linkages between flows connected with harmful or possibly malicious activity to provide network malware researchers and analysts with more thorough information. These labels were painstakingly created at the Stratosphere labs using malware capture analysis.

We present a concise explanation of the labels used for the identification of malicious flows, based on manual network analysis, below:

Attack: This label signifies the occurrence of an attack originating from an infected device directed towards another host. Any flow that endeavors to exploit a vulnerable service, discerned through payload and behavioral analysis, falls under this classification. Examples include brute force attempts on telnet logins or header-based command injections in GET requests.

Benign: The "Benign" label denotes connections where no suspicious or malicious activities have been detected.

C&C (Command and Control): This label indicates that the infected device has established a connection with a Command and Control server. This observation is rooted in the periodic nature of connections or activities such as binary downloads or the exchange of IRC-like or decoded commands.

DDoS (Distributed Denial of Service): "DDoS" is assigned when the infected device is actively involved in a Distributed Denial of Service attack, identifiable by the volume of flows directed towards a single IP address.

FileDownload: This label signifies that a file is being downloaded to the infected device. It is determined by examining connections with response bytes exceeding a specified threshold (typically 3KB or 5KB), often in conjunction with known suspicious destination ports or IPs associated with Command and Control servers.

HeartBeat: "HeartBeat" designates connections where packets serve the purpose of tracking the infected host by the Command and Control server. Such connections are identified through response bytes below a certain threshold (typically 1B) and exhibit periodic similarities. This is often associated with known suspicious destination ports or IPs linked to Command and Control servers.

Mirai: This label is applied when connections exhibit characteristics resembling those of the Mirai botnet, based on patterns consistent with common Mirai attack profiles.

Okiru: Similar to "Mirai," the "Okiru" label is assigned to connections displaying characteristics of the Okiru botnet. The parameters for this label are the same as for Mirai, but Okiru is a less prevalent botnet family.

PartOfAHorizontalPortScan: This label is employed when connections are involved in a horizontal port scan aimed at gathering information for potential subsequent attacks. The labeling decision hinges on patterns such as shared ports, similar transmitted byte counts, and multiple distinct destination IPs among the connections.

Torii: The "Torii" label is used when connections exhibit traits indicative of the Torii botnet, with labeling criteria similar to those used for Mirai, albeit in the context of a less common botnet family.

Field Name	Description	Type
ts	The timestamp of the connection event.	time
uid	A unique identifier for the connection.	string
id.orig_h	The source IP address.	addr
id.orig_p	The source port.	port
id.resp_h	The destination IP address.	addr
id.resp_p	The destination port.	port
proto	The network protocol used (e.g., 'tcp').	enum
service	The service associated with the connection.	string
duration	The duration of the connection.	interval
orig_bytes	The number of bytes sent from the source to the destination.	count
resp_bytes	The number of bytes sent from the destination to the source.	count
conn_state	The state of the connection.	string
local_orig	Indicates whether the connection is considered local or not.	bool
local_resp	Indicates whether the connection is considered...

m
Data from: A Dataset for Buffering Delays Due to the Interaction Between the...
data.mendeley.com
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad Al-Hammouri (2021). A Dataset for Buffering Delays Due to the Interaction Between the Nagle Algorithm and the Delayed Acknowledgement Algorithm in Cyber-Physical Systems Communication [Dataset]. http://doi.org/10.17632/zhbpyvt4g9.1
Explore at:
Unique identifier
https://doi.org/10.17632/zhbpyvt4g9.1
Dataset updated
Sep 30, 2021
Authors
Ahmad Al-Hammouri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here, we provide the research community with a data set for the buffering delays that data packets experience at TCP sending side in the realm of Cyber-Physical Systems (CPSs). We focus on the buffering that occurs at the sender side due to the the adverse interaction between the Nagle algorithm and the delayed acknowledgement algorithm, which both were originally introduced into TCP to prevent sending out many small-sized packets over the network.

The data set is collected using four real-life operating systems: Windows, Linux, FreeBSD, and QNX (a real-time operating system). In each scenario, there are three separate different (virtual) machines running various operating systems. One machine, or an end-host, acts a data source, another acts as a data sink, and a third acts a network emulator that introduces artificial propagation delays between the source and the destination.

To measure buffering delay at the sender side, we record for each sent packet the two time instants: when the packet is first generated at the application layer, and when it is actually sent on the physical network. In each case, 10 different independent experiment replications/runs are executed.

Here, we provide the full distribution of all delay samples represented by the cumulative distribution function (CDF).

The data exhibited here gives an impression of the amount and scale of the delay occurring at sender-side in TCP. More importantly, the data can be used investigate to what degree these delays affect the performance of cyber-physical systems or other real-time applications employing TCP.
f
Unravelling travellers’ route choice behaviour at full-scale urban network...
plos.figshare.com
figshare.com
tiff
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Humberto González Ramírez; Ludovic Leclercq; Nicolas Chiabaut; Cécile Becarie; Jean Krug (2023). Unravelling travellers’ route choice behaviour at full-scale urban network by focusing on representative OD pairs in computer experiments [Dataset]. http://doi.org/10.1371/journal.pone.0225069
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0225069
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Humberto González Ramírez; Ludovic Leclercq; Nicolas Chiabaut; Cécile Becarie; Jean Krug
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In a city-scale network, trips are made in thousands of origin-destination (OD) pairs connected by multiple routes, resulting in a large number of alternatives with diverse characteristics that influence the route choice behaviour of the travellers. As a consequence, to accurately predict user choices at full network scale, a route choice model should be scalable to suit all possible configurations that may be encountered. In this article, a new methodology to obtain such a model is proposed. The main idea is to use clustering analysis to obtain a small set of representative OD pairs and routes that can be investigated in detail through computer route choice experiments to collect observations on travellers behaviour. The results are then scaled-up to all other OD pairs in the network. It was found that 9 OD pair configurations are sufficient to represent the network of Lyon, France, composed of 96,096 OD pairs and 559,423 routes. The observations, collected over these nine representative OD pair configurations, were used to estimate three mixed logit models. The predictive accuracy of the three models was tested against the predictive accuracy of the same models (with the same specification), but estimated over randomly selected OD pair configurations. The obtained results show that the models estimated with the representative OD pairs are superior in predictive accuracy, thus suggesting the scaling-up to the entire network of the choices of the participants over the representative OD pair configurations, and validating the methodology in this study.
A Synthetic Dataset for Predictive Risk Analysis and Path Optimization of...
zenodo.org
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saif Saad Shihab; Saif Saad Shihab; Sarmad Makki; Sarmad Makki; Ahmad Hashim Hussein; Ahmad Hashim Hussein (2025). A Synthetic Dataset for Predictive Risk Analysis and Path Optimization of Arbaeen Pilgrimage Crowds [Dataset]. http://doi.org/10.5281/zenodo.15781743
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15781743
Dataset updated
Jul 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Saif Saad Shihab; Saif Saad Shihab; Sarmad Makki; Sarmad Makki; Ahmad Hashim Hussein; Ahmad Hashim Hussein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Overview:
This dataset was synthetically generated to support research on crowd management, predictive risk analysis, and smart path optimization, specifically for the scenario of the Arbaeen pilgrimage in Iraq. The data simulates various environmental, behavioral, and geographical factors to create a realistic environment for testing and validating crowd analysis models. The primary goal of this dataset is to provide a basis for developing and evaluating intelligent systems that can enhance the safety and efficiency of crowd movements during large-scale pilgrimages. This dataset was specifically generated to support the findings of our research paper, titled "Predictive Risk Analysis for the Arbaeen Pilgrimage Crowds" which has been submitted for publication.

Methodology:
The dataset was created using a custom Python script that employs a hierarchical generation strategy:

Base Network: Predefined, real-world routes (such as the traditional Baghdad-Karbala path) were established as a guaranteed core.

Local Network: A dense, realistic local network was built by systematically connecting each geographical node (both real cities and synthetic points) to its nearest neighbors.

Highway Network: A higher-level network was constructed by connecting only the major, real-world cities to each other, simulating main travel arteries.

Data Attributes: For each area, attributes such as visitor count, pressure, weather, and the presence of barriers or events were generated based on a set of rules to simulate realistic conditions. The final Risk_Degree for each area is a calculated metric based on these attributes.

Dataset Contents:

The dataset is provided as a single, comprehensive CSV file: artificial generated dataset for crowd.csv.

This file is ready for direct use and contains all the necessary data used in our study, fully merged into one table. It consists of 5,000 unique records, where each record represents a connection between two areas ("from" and "to"). Each row includes the following detailed information for both the origin and destination points:

Route Information: The specific from_area and to_area for each path segment.

Area Attributes: Key metrics such as Visitors, Pressure, Speed, and environmental factors like Weather and Event.

Calculated Risk Metrics: The final Risk_Degree and Actual_Behavior classification for each area.

Geographical Coordinates: The Latitude and Longitude for each area.
Federated Learning for Distributed Intrusion Detection Systems in Public...
data.europa.eu
unknown
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). Federated Learning for Distributed Intrusion Detection Systems in Public Networks - Validation Dataset [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7956304?locale=el
Explore at:
unknownAvailable download formats
Dataset updated
May 23, 2023
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset has been meticulously prepared and utilized as a validation set during the evaluation phase of "Meta IDS" to asses the performance of various machine learning models. It is now made available for interested users and researchers who seek a reliable and diverse dataset for training and testing their own custom models. The validation dataset comprises a comprehensive collection of labeled entries, that determines whether the packet type is "malicious" or "benign." It covers complex design patterns that are commonly encountered in real-world applications. The dataset is designed to be representative, encompassing edge and fog layers that are in contact with cloud layer, thereby enabling thorough testing and evaluation of different models. Each sample in the dataset is labeled with the corresponding ground truth, providing a reliable reference for model performance evaluation. To ensure convenient distribution and storage, the dataset has been broken down into three separate batches, each containing a portion of the dataset. This allows for convenient downloading and management of the dataset. The three batches are provided as individual compressed files. In order to extract the data, follow the following instructions: Download and install bzip2 (if not already installed) from the official website or your package manager. Place the compressed dataset file in a directory of your choice. Open a terminal or command prompt and navigate to the directory where the compressed dataset file is located. Execute the following command to uncompress the dataset: bzip2 -d filename.bz2 Replace "filename.bz2" with the actual name of the compressed dataset file. Once uncompressed, you will have access to the dataset in its original format for further exploration, analysis, and model training etc. The total storage required for extraction is approximately 800 GB in total, with the first batch requiring approximately 302 GB, the second batch requiring approximately 203 GB, and the third batch requiring approximately 297 GB of data storage. The first batch contains 1,049,527,992 entries, where as the second batch contains 711,043,331 entries, and for the third and last batch we have 1,029,303,062 entries. The following table provides the feature names along with their explanation and example value once the dataset is extracted. Feature Description Example Value ip.src Source IP address in the packet a05d4ecc38da01406c9635ec694917e969622160e728495e3169f62822444e17 ip.dst Destination IP address in the packet a52db0d87623d8a25d0db324d74f0900deb5ca4ec8ad9f346114db134e040ec5 frame.time_epoch Epoch time of the frame 1676165569.930869 arp.hw.type Hardware type 1 arp.hw.size Hardware size 6 arp.proto.size Protocol size 4 arp.opcode Opcode 2 data.len Length 2713 eth.dst.lg Destination LG bit 1 eth.dst.ig Destination IG bit 1 eth.src.lg Source LG bit 1 eth.src.ig Source IG bit 1 frame.offset_shift Time shift for this packet 0 frame.len frame length on the wire 1208 frame.cap_len Frame length stored into the capture file 215 frame.marked Frame is marked 0 frame.ignored Frame is ignored 0 frame.encap_type Encapsulation type 1 gre Generic Routing Encapsulation 'Generic Routing Encapsulation (IP)’ ip.version Version 6 ip.hdr_len Header length 24 ip.dsfield.dscp Differentiated Services Codepoint 56 ip.dsfield.ecn Explicit Congestion Notification 2 ip.len Total length 614 ip.flags.rb Reserved bit 0 ip.flags.df Don't fragment 1 ip.flags.mf More fragments 0 ip.frag_offset Fragment offset 0 ip.ttl Time to live 31 ip.proto Protocol 47 ip.checksum.status Header checksum status 2 tcp.srcport TCP source port 53425 tcp.flags Flags 0x00000098 tcp.flags.ns Nonce 0 tcp.flags.cwr Congestion Window Reduced (CWR) 1 udp.srcport UDP source port 64413 udp.dstport UDP destination port 54087 udp.stream Stream index 1345 udp.length Length 225 udp.checksum.status Checksum status 3 packet_type Type of the packet which is either "benign" or "malicious" 0 Furthermore, in compliance with the GDPR and to ensure the privacy of individuals, all IP addresses present in the dataset have been anonymized through hashing. This anonymization process helps protect the identity of individuals while preserving the integrity and utility of the dataset for research and model development purposes. Please note that while the dataset provides valuable insights and a solid foundation for machine learning tasks, it is not a substitute for extensive real-world data collection. However, it serves as a valuable resource for researchers, practitioners, and enthusiasts in the machine learning community, offering a compliant and anonymized dataset for developing and validating custom models in a specific problem domain. By leveraging the validation dataset for machine learning model evaluation and custom model training, users can accelerate their research and development efforts, building upon the knowledge gained from my thesis while contributing to the advancement of

Facebook

Twitter

Click to copy link

Link copied

Cite

Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset

Network Traffic Dataset

Use this Dataset for analysis the network traffic and designing the applications

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 31, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ravikumar Gattu

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

Content :

This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

Dataset Columns:

No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

Acknowledgements :

I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

Ravikumar Gattu , Susmitha Choppadandi

Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

**Dataset License: ** CC0: Public Domain

Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

ML techniques benefits from this Dataset :

This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.
Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.

Clear search

Close search

Google apps

Main menu

Network Traffic Dataset

Counterpart Paths: Example paths, comparison network, and SCPPOD Output

Network Traffic Data-Malicious Activity Detection

Helsinki Region Travel Time Matrix

Greater Cambridge ANPR Data: Origin to Destination Reports - Dataset -...

San Diego Test Data Sets

Multicriteria Wetland Network

VOLPE National Performance Monitoring Research Data Set (NPMRDS) -...

VPN-nonVPN dataset

Network Digital Twin-Generated Dataset for Machine Learning-based Detection...

Overview

Feature Set:

Dataset Variations:

Network traffic datasets with novel extended IP flow called NetTiSA flow

IoT-deNAT: Outbound flow-based network traffic data of IoT and non-IoT...

HTTPS Brute-force dataset with extended network flows

Grocery Access Map Gallery

Data from: CESNET-QUIC22: A large one-month QUIC network traffic dataset...

Malware Detection in Network Traffic Data

Data from: A Dataset for Buffering Delays Due to the Interaction Between the...

Unravelling travellers’ route choice behaviour at full-scale urban network...

A Synthetic Dataset for Predictive Risk Analysis and Path Optimization of...

Federated Learning for Distributed Intrusion Detection Systems in Public...

Network Traffic Dataset

Use this Dataset for analysis the network traffic and designing the applications