Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic datasets created by Single Flow Time Series Analysis
Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:
J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.
This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf
In the following table is a description of each dataset file:
File name | Detection problem | Citation of original raw dataset |
botnet_binary.csv | Binary detection of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
botnet_multiclass.csv | Multi-class classification of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
cryptomining_design.csv | Binary detection of cryptomining; the design part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
cryptomining_evaluation.csv | Binary detection of cryptomining; the evaluation part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
dns_malware.csv | Binary detection of malware DNS | Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021. |
doh_cic.csv | Binary detection of DoH |
Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020 |
doh_real_world.csv | Binary detection of DoH | Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022 |
dos.csv | Binary detection of DoS | Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019. |
edge_iiot_binary.csv | Binary detection of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
edge_iiot_multiclass.csv | Multi-class classification of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
https_brute_force.csv | Binary detection of HTTPS Brute Force | Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020 |
ids_cic_binary.csv | Binary detection of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_cic_multiclass.csv | Multi-class classification of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_unsw_nb_15_binary.csv | Binary detection of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
ids_unsw_nb_15_multiclass.csv | Multi-class classification of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
iot_23.csv | Binary detection of IoT malware | Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23 |
ton_iot_binary.csv | Binary detection of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
ton_iot_multiclass.csv | Multi-class classification of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
tor_binary.csv | Binary detection of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
tor_multiclass.csv | Multi-class classification of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
vpn_iscx_binary.csv | Binary detection of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_iscx_multiclass.csv | Multi-class classification of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_vnat_binary.csv | Binary detection of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
vpn_vnat_multiclass.csv | Multi-class classification of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created by a LoRaWAN sniffer and contains packets, which are thoroughly analyzed in the paper Exploring LoRaWAN Traffic: In-Depth Analysis of IoT Network Communications. Data from the LoRaWAN sniffer was collected in four cities: Liege (Belgium), Graz (Austria), Vienna (Austria), and Brno (Czechia).
Gateway ID: b827ebafac000001
Uplink reception (end-device => gateway)
Only packets containing CRC, inverted IQ
RX0: 867.1 MHz, 867.3 MHz, 867.5 MHz, 867.7 MHz, 867.9 MHz - BW 125 kHz and all SF
RX1: 868.1 MHz, 868.3 MHz, 868.5 MHz - BW 125 kHz and all SF
Gateway ID: b827ebafac000002
Downlink reception (gateway => end-device)
Includes packets without CRC, non-inverted IQ
RX0: 867.1 MHz, 867.3 MHz, 867.5 MHz, 867.7 MHz, 867.9 MHz - BW 125 kHz and all SF
RX1: 868.1 MHz, 868.3 MHz, 868.5 MHz - BW 125 kHz and all SF
Gateway ID: b827ebafac000003
Downlink reception (gateway => end-device) and Class-B beacon on 869.525 MHz
Includes packets without CRC, non-inverted IQ
RX0: 869.525 MHz - BW 125 kHz and all SF, BW 125 kHz and SF9 with implicit header, CR 4/5 and length 17 B
To open the pcap files, you need Wireshark with current support for LoRaTap and LoRaWAN protocols. This support will be available in the official 4.1.0 release. A working version for Windows is accessible in the automated build system.
The source data is available in the log.zip file, which contains the complete dataset obtained by the sniffer. A set of conversion tools for log processing is available on Github. The converted logs, available in Wireshark format, are stored in pcap.zip. For the LoRaWAN decoder, you can use the attached root and session keys. The processed outputs are stored in csv.zip, and graphical statistics are available in png.zip.
This data represents a unique, geographically identifiable selection from the full log, cleaned of any errors. The records from Brno include communication between the gateway and a node with known keys.
Test file :: 00_Test
short test file for parser verification
comparison of LoRaTap version 0 and version 1 formats
Brno, Czech Republic :: 01_Brno
49.22685N, 16.57536E, ASL 306m
lines 150873 to 529796
time 1.8.2022 15:04:28 to 17.8.2022 13:05:32
preliminary experiment
experimental device
Device EUI: 70b3d5cee0000042
Application key: d494d49a7b4053302bdcf96f1defa65a
Device address: 00d85395
Network session key: c417540b8b2afad8930c82fcf7ea54bb
Application session key: 421fea9bedd2cc497f63303edf5adf8e
Liege, Belgium :: 02_Liege :: evaluated in the paper
50.66445N, 5.59276E, ASL 151m
lines 636205 to 886868
time 25.8.2022 10:12:24 to 12.9.2022 06:20:48
Brno, Czech Republic :: 03_Brno_join
49.22685N, 16.57536E, ASL 306m
lines 947787 to 979382
time 30.9.2022 15:21:27 to 4.10.2022 10:46:31
record contains OTAA activation (Join Request / Join Accept)
experimental device:
Device EUI: 70b3d5cee0000042
Application key: d494d49a7b4053302bdcf96f1defa65a
Device address: 01e65ddc
Network session key: e2898779a03de59e2317b149abf00238
Application session key: 59ca1ac91922887093bc7b236bd1b07f
Graz, Austria :: 04_Graz :: evaluated in the paper
47.07049N, 15.44506E, ASL 364m
lines 1015139 to 1178855
time 26.10.2022 06:21:07 to 29.11.2022 10:03:00
Vienna, Austria :: 05_Wien :: evaluated in the paper
48.19666N, 16.37101E, ASL 204m
lines 1179308 to 3657105
time 1.12.2022 10:42:19 to 4.1.2023 14:00:05
contains a total of 14 short restarts (under 90 seconds)
Brno, Czech Republic :: 07_Brno :: evaluated in the paper
49.22685N, 16.57536E, ASL 306m
lines 4969648 to 6919392
time 16.2.2023 8:53:43 to 30.3.2023 9:00:11
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VLC Data: A Multi-Class Network Traffic Dataset Covering Diverse Applications and Platforms Valencia Data (VLC Data) is a network traffic dataset collected from various applications and platforms. It includes both encrypted and, when applicable, unencrypted protocols, capturing realistic usage scenarios and application-specific behavior. The dataset covers 18.5 hours, 58 pcapng files, and 24.26 GB, with traffic from: Video streaming: Netflix and Prime Video (10–50 min) via Firefox. Gaming: Roblox sessions on Windows (20–35 min), recorded outside of virtual machines, despite VM support. Video conferencing: Microsoft Teams (20 min) via Firefox. Web browsing: Wikipedia, BBC, Google, LinkedIn, Amazon, and OWIN6G (2–5 min) via Firefox or Chrome. Audio streaming: Spotify (30–33 min) on multiple OS. Web streaming: YouTube in 4K and Full HD (20–30 min). This dataset is publicly available for traffic analysis across different apps, protocols, and systems. Table Description: Type Applications Platform Time [min] Comments Filename Size (MB) Video Streaming Netflix Linux 10 Running Netflix on Firefox Browser netflix_linux_10m_01 95.1 Video Streaming Netflix Linux 20 Running Netflix on Firefox Browser netflix_linux_20m_01 167.7 Video Streaming Netflix Linux 20 Running Netflix on Firefox Browser netflix_linux_20m_02 237.9 Video Streaming Netflix Linux 20 Running Netflix on Firefox Browser netflix_linux_20m_03 212.6 Video Streaming Netflix Linux 25 Running Netflix on Firefox, but 2 min in Menu netflix_linux_25m_01 610.7 Video Streaming Netflix Linux 35 Running Netflix on Firefox, but 1 min in Menu netflix_linux_35m_01 534.8 Video Streaming Netflix Linux 50 Running Netflix on Firefox Browser netflix_linux_50m_01 660.9 Video Streaming Netflix Windows 10 Running Netflix on Firefox Browser netflix_windows_10m_01 132.1 Video Streaming Netflix Windows 20 Running Netflix on Firefox Browser netflix_windows_20m_01 506.4 Video Streaming Prime Video Linux 20 Running Prime Video on Firefox Browser prime_linux_20m_01 767.3 Video Streaming Prime Video Linux 20 Running Prime Video on Firefox Browser prime_linux_20m_02 569.3 Video Streaming Prime Video Windows 20 Running Prime Video on Firefox Browser prime_windows_20m_01 512.3 Video Streaming Prime Video Windows 20 Running Prime Video on Firefox Browser prime_windows_20m_02 364.2 Gaming Roblox Windows 20 Doesn't run in VM roblox_windows_20m_01 127.5 Gaming Roblox Windows 20 Doesn't run in VM roblox_windows_20m_02 378.5 Gaming Roblox Windows 20 Doesn't run in VM roblox_windows_20m_03 458.9 Gaming Roblox Windows 30 Doesn't run in VM roblox_windows_30m_01 519.8 Gaming Roblox Windows 30 Doesn't run in VM roblox_windows_30m_02 357.3 Gaming Roblox Windows 35 Doesn't run in VM roblox_windows_35m_01 880.4 Audio Streaming Spotify Linux 30 Running Spotify app on Ubuntu-Linux spotify_linux_30m_01 98.2 Audio Streaming Spotify Linux 30 Running Spotify app on Ubuntu-Linux spotify_linux_30m_02 112.2 Audio Streaming Spotify Linux 30 Running Spotify app on Ubuntu-Linux spotify_linux_30m_03 175.5 Audio Streaming Spotify Windows 30 Running Spotify app on Windows spotify_windows_30m_01 50.7 Audio Streaming Spotify Windows 30 Doesn't run in VM spotify_windows_30m_02 63.2 Audio Streaming Spotify Windows 33 Running Spotify app on Windows spotify_windows_33m_01 70.9 Video Conferencing Teams Linux 20 Running Teams on Firefox Browser teams_linux_20m_01 134.6 Video Conferencing Teams Linux 20 Running Teams on Firefox Browser teams_linux_20m_02 343.3 Video Conferencing Teams Linux 20 Running Teams on Firefox Browser teams_linux_20m_03 376.6 Video Conferencing Teams Windows 20 Running Teams on Firefox Browser teams_windows_20m_01 634.1 Video Conferencing Teams Windows 20 Running Teams on Firefox Browser teams_windows_20m_02 517.8 Video Conferencing Teams Windows 20 Running Teams on Firefox Browser teams_windows_20m_03 629.9 Web Browsing Web Linux 2 OWIN6G website on Firefox Browser web_linux_2m_owin6g 1.2 Web Browsing Web Linux 2 Wikipedia website on Firefox Browser web_linux_2m_wikipedia 19.7 Web Browsing Web Linux 3 OWIN6G website on Firefox Browser web_linux_3m_owin6g 4.5 Web Browsing Web Linux 3 Wikipedia website on Firefox Browser web_linux_3m_wikipedia 23.5 Web Browsing Web Linux 5 Amazon website on Chrome Browser web_linux_5m_amazon 262.9 Web Browsing Web Linux 5 BBC website on Firefox Browser web_linux_5m_bbc 55.7 Web Browsing Web Linux 5 Google website on Firefox Browser web_linux_5m_google 22.6 Web Browsing Web Linux 5 Linkedin website on Firefox Browser web_linux_5m_linkedin 39.8 Web Browsing Web Windows 3 OWIN6G website on Firefox Browser web_windows_3m_owin6g 32.6 Web Browsing Web
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DESCRIPTION OF THE RESEARCH AND DATA: This work presents the Madrid Traffic Dataset (MTD), a comprehensive resource for the analysis and modeling of traffic patterns in Madrid. The dataset integrates data from traffic sensors, weather observations, calendar information, road infrastructure, and geolocation data to support advanced studies of urban mobility and predictive modeling.
In addition to the core data sources, the dataset includes temporal sequences and a traffic adjacency matrix, enabling the application of time-series analysis and graph-based modeling approaches.
-COMPLETE DATASET: The complete version of the MTD includes data from 554 traffic sensors distributed across the Madrid region, covering a total of 30 months (from June 2022 to November 2024).
-SUBSET DATASET: A more compact version derived from the complete dataset, focused on a subset of 300 traffic sensors with 17 months of data (from June 2022 to October 2023). This subset is designed for researchers requiring a lighter dataset.
DATA ORGANIZATION: The dataset is organized in a main directory containing a subfolder identified by the configuration data hash. This subfolder includes all key components: datasets, temporal sequences, adjacency matrices, and configuration files. The structure ensures that all resources are clearly arranged to facilitate easy access and reproducibility for researchers.
For more details, see [Submitted to IEEE Internet of the Things Journal].
Split:
application: 38 samples
Class Distribution:
car (ID: 3): 67 (46.2%) motorcycle (ID: 4): 14 (9.7%) airplane (ID: 5): 62 (42.8%) truck (ID: 8): 2 (1.4%)
Annotation Files:
Latest: application/application_labels.json Timestamped: application/application_labels_20250309_212205.json
Split:
application: 49 samples
Class Distribution:
car (ID: 3): 120 (60.0%) motorcycle (ID: 4): 14 (7.0%) airplane (ID: 5): 62 (31.0%) truck (ID: 8):… See the full description on the dataset page: https://huggingface.co/datasets/slliac/isom5240-td-application-traffic-analysis.
The 2006 Second Edition TIGER/Line files are an extract of selected geographic and cartographic information from the Census TIGER database. The geographic coverage for a single TIGER/Line file is a county or statistical equivalent entity, with the coverage area based on the latest available governmental unit boundaries. The Census TIGER database represents a seamless national file with no overlaps or gaps between parts. However, each county-based TIGER/Line file is designed to stand alone as an independent data set or the files can be combined to cover the whole Nation. The 2006 Second Edition TIGER/Line files consist of line segments representing physical features and governmental and statistical boundaries. This shapefile represents the current Traffic Analysis Zones for Sandoval County stored in the 2006 TIGER Second Edition dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data presented here was collected in a network section from Universidad Del Cauca, Popayán, Colombia by performing packet captures at different hours, during morning and afternoon, over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017. A total of 3.577.296 instances were collected and are currently stored in a CSV (Comma Separated Values) file.
This dataset contains 87 features. Each instance holds the information of an IP flow generated by a network device i.e., source and destination IP addresses, ports, interarrival times, layer 7 protocol (application) used on that flow as the class, among others. Most of the attributes are numeric type but there are also nominal types and a date type due to the Timestamp.
The flow statistics (IP addresses, ports, inter-arrival times, etc) were obtained using CICFlowmeter (http://www.unb.ca/cic/research/applications.html - https://github.com/ISCX/CICFlowMeter). The application layer protocol was obtained by performing a DPI (Deep Packet Inspection) processing on the flows with ntopng (https://www.ntop.org/products/traffic-analysis/ntop/ - https://github.com/ntop/ntopng).
For further information and if you find this dataset useful, please read and cite the following papers:
Springer: https://link.springer.com/chapter/10.1007/978-3-319-95168-3_37
IEEExplore https://ieeexplore.ieee.org/document/8845576
Research Gate: https://www.researchgate.net/publication/345990587_Smart_User_Consumption_Profiling_Incremental_Learning-based_OTT_Service_Degradation
IEEExpore https://ieeexplore.ieee.org/document/9258898
I would like to thank Universidad Del Cauca for supporting the research that generated this dataset and Colciencias for my PhD scholarship.
Considering that most of the network traffic classification datasets are aimed only at identifying the type of application an IP flow holds (WWW, DNS, FTP, P2P, Telnet,etc), this dataset goes a step further by generating machine learning models capable of detecting specific applications such as Facebook, YouTube, Instagram, etc, from IP flow statistics (currently 75 applications).
Here are a few use cases for this project:
Traffic Monitoring Systems: The model could be leveraged by city planning departments or traffic control centers to automatically identify and monitor different types of traffic on roads in real-time. This could assist in efficient traffic management, congestion detection, and traffic light timing adjustment.
Autonomous Vehicles: Companies developing self-driving cars or drones could utilize the model to improve their vehicle's ability to recognize different types of vehicles on the road, ensuring safer navigation.
Security and Surveillance: The model could be used in CCTV camera systems to detect, classify, and track vehicles around sensitive areas like government buildings, airports, or high-security areas for security enhancement and crime prevention.
Traffic Analysis for Urban Planning: Urban planners and researchers can use the model to study traffic patterns based on vehicle type over time, informing future infrastructure and transportation planning.
Enhanced Vehicle-based Augmented Reality (AR): Game developers or AR app creators who focus on city or traffic scenarios can use the model to enhance their system's ability to accurately detect and interact with real-world vehicles, promoting a more immersive experience for users.
Dataset for the textbook Computational Methods and GIS Applications in Social Science (3rd Edition), 2023 Fahui Wang, Lingbo Liu Main Book Citation: Wang, F., & Liu, L. (2023). Computational Methods and GIS Applications in Social Science (3rd ed.). CRC Press. https://doi.org/10.1201/9781003292302 KNIME Lab Manual Citation: Liu, L., & Wang, F. (2023). Computational Methods and GIS Applications in Social Science - Lab Manual. CRC Press. https://doi.org/10.1201/9781003304357 KNIME Hub Dataset and Workflow for Computational Methods and GIS Applications in Social Science-Lab Manual Update Log If Python package not found in Package Management, use ArcGIS Pro's Python Command Prompt to install them, e.g., conda install -c conda-forge python-igraph leidenalg NetworkCommDetPro in CMGIS-V3-Tools was updated on July 10,2024 Add spatial adjacency table into Florida on June 29,2024 The dataset and tool for ABM Crime Simulation were updated on August 3, 2023, The toolkits in CMGIS-V3-Tools was updated on August 3rd,2023. Report Issues on GitHub https://github.com/UrbanGISer/Computational-Methods-and-GIS-Applications-in-Social-Science Following the website of Fahui Wang : http://faculty.lsu.edu/fahui Contents Chapter 1. Getting Started with ArcGIS: Data Management and Basic Spatial Analysis Tools Case Study 1: Mapping and Analyzing Population Density Pattern in Baton Rouge, Louisiana Chapter 2. Measuring Distance and Travel Time and Analyzing Distance Decay Behavior Case Study 2A: Estimating Drive Time and Transit Time in Baton Rouge, Louisiana Case Study 2B: Analyzing Distance Decay Behavior for Hospitalization in Florida Chapter 3. Spatial Smoothing and Spatial Interpolation Case Study 3A: Mapping Place Names in Guangxi, China Case Study 3B: Area-Based Interpolations of Population in Baton Rouge, Louisiana Case Study 3C: Detecting Spatiotemporal Crime Hotspots in Baton Rouge, Louisiana Chapter 4. Delineating Functional Regions and Applications in Health Geography Case Study 4A: Defining Service Areas of Acute Hospitals in Baton Rouge, Louisiana Case Study 4B: Automated Delineation of Hospital Service Areas in Florida Chapter 5. GIS-Based Measures of Spatial Accessibility and Application in Examining Healthcare Disparity Case Study 5: Measuring Accessibility of Primary Care Physicians in Baton Rouge Chapter 6. Function Fittings by Regressions and Application in Analyzing Urban Density Patterns Case Study 6: Analyzing Population Density Patterns in Chicago Urban Area >Chapter 7. Principal Components, Factor and Cluster Analyses and Application in Social Area Analysis Case Study 7: Social Area Analysis in Beijing Chapter 8. Spatial Statistics and Applications in Cultural and Crime Geography Case Study 8A: Spatial Distribution and Clusters of Place Names in Yunnan, China Case Study 8B: Detecting Colocation Between Crime Incidents and Facilities Case Study 8C: Spatial Cluster and Regression Analyses of Homicide Patterns in Chicago Chapter 9. Regionalization Methods and Application in Analysis of Cancer Data Case Study 9: Constructing Geographical Areas for Mapping Cancer Rates in Louisiana Chapter 10. System of Linear Equations and Application of Garin-Lowry in Simulating Urban Population and Employment Patterns Case Study 10: Simulating Population and Service Employment Distributions in a Hypothetical City Chapter 11. Linear and Quadratic Programming and Applications in Examining Wasteful Commuting and Allocating Healthcare Providers Case Study 11A: Measuring Wasteful Commuting in Columbus, Ohio Case Study 11B: Location-Allocation Analysis of Hospitals in Rural China Chapter 12. Monte Carlo Method and Applications in Urban Population and Traffic Simulations Case Study 12A. Examining Zonal Effect on Urban Population Density Functions in Chicago by Monte Carlo Simulation Case Study 12B: Monte Carlo-Based Traffic Simulation in Baton Rouge, Louisiana Chapter 13. Agent-Based Model and Application in Crime Simulation Case Study 13: Agent-Based Crime Simulation in Baton Rouge, Louisiana Chapter 14. Spatiotemporal Big Data Analytics and Application in Urban Studies Case Study 14A: Exploring Taxi Trajectory in ArcGIS Case Study 14B: Identifying High Traffic Corridors and Destinations in Shanghai Dataset File Structure 1 BatonRouge Census.gdb BR.gdb 2A BatonRouge BR_Road.gdb Hosp_Address.csv TransitNetworkTemplate.xml BR_GTFS Google API Pro.tbx 2B Florida FL_HSA.gdb R_ArcGIS_Tools.tbx (RegressionR) 3A China_GX GX.gdb 3B BatonRouge BR.gdb 3C BatonRouge BRcrime R_ArcGIS_Tools.tbx (STKDE) 4A BatonRouge BRRoad.gdb 4B Florida FL_HSA.gdb HSA Delineation Pro.tbx Huff Model Pro.tbx FLplgnAdjAppend.csv 5 BRMSA BRMSA.gdb Accessibility Pro.tbx 6 Chicago ChiUrArea.gdb R_ArcGIS_Tools.tbx (RegressionR) 7 Beijing BJSA.gdb bjattr.csv R_ArcGIS_Tools.tbx (PCAandFA, BasicClustering) 8A Yunnan YN.gdb R_ArcGIS_Tools.tbx (SaTScanR) 8B Jiangsu JS.gdb 8C Chicago ChiCity.gdb cityattr.csv ...
The 2006 Second Edition TIGER/Line files are an extract of selected geographic and cartographic information from the Census TIGER database. The geographic coverage for a single TIGER/Line file is a county or statistical equivalent entity, with the coverage area based on the latest available governmental unit boundaries. The Census TIGER database represents a seamless national file with no overlaps or gaps between parts. However, each county-based TIGER/Line file is designed to stand alone as an independent data set or the files can be combined to cover the whole Nation. The 2006 Second Edition TIGER/Line files consist of line segments representing physical features and governmental and statistical boundaries. This shapefile represents the current Traffic Analysis Zones for Bernalillo County stored in the 2006 TIGER Second Edition dataset.
Our dataset provides detailed and precise insights into the business, commercial, and industrial aspects of any given area in the USA (Including Point of Interest (POI) Data and Foot Traffic. The dataset is divided into 150x150 sqm areas (geohash 7) and has over 50 variables. - Use it for different applications: Our combined dataset, which includes POI and foot traffic data, can be employed for various purposes. Different data teams use it to guide retailers and FMCG brands in site selection, fuel marketing intelligence, analyze trade areas, and assess company risk. Our dataset has also proven to be useful for real estate investment.- Get reliable data: Our datasets have been processed, enriched, and tested so your data team can use them more quickly and accurately.- Ideal for trainning ML models. The high quality of our geographic information layers results from more than seven years of work dedicated to the deep understanding and modeling of geospatial Big Data. Among the features that distinguished this dataset is the use of anonymized and user-compliant mobile device GPS location, enriched with other alternative and public data.- Easy to use: Our dataset is user-friendly and can be easily integrated to your current models. Also, we can deliver your data in different formats, like .csv, according to your analysis requirements. - Get personalized guidance: In addition to providing reliable datasets, we advise your analysts on their correct implementation.Our data scientists can guide your internal team on the optimal algorithms and models to get the most out of the information we provide (without compromising the security of your internal data).Answer questions like: - What places does my target user visit in a particular area? Which are the best areas to place a new POS?- What is the average yearly income of users in a particular area?- What is the influx of visits that my competition receives?- What is the volume of traffic surrounding my current POS?This dataset is useful for getting insights from industries like:- Retail & FMCG- Banking, Finance, and Investment- Car Dealerships- Real Estate- Convenience Stores- Pharma and medical laboratories- Restaurant chains and franchises- Clothing chains and franchisesOur dataset includes more than 50 variables, such as:- Number of pedestrians seen in the area.- Number of vehicles seen in the area.- Average speed of movement of the vehicles seen in the area.- Point of Interest (POIs) (in number and type) seen in the area (supermarkets, pharmacies, recreational locations, restaurants, offices, hotels, parking lots, wholesalers, financial services, pet services, shopping malls, among others). - Average yearly income range (anonymized and aggregated) of the devices seen in the area.Notes to better understand this dataset:- POI confidence means the average confidence of POIs in the area. In this case, POIs are any kind of location, such as a restaurant, a hotel, or a library. - Category confidences, for example"food_drinks_tobacco_retail_confidence" indicates how confident we are in the existence of food/drink/tobacco retail locations in the area. - We added predictions for The Home Depot and Lowe's Home Improvement stores in the dataset sample. These predictions were the result of a machine-learning model that was trained with the data. Knowing where the current stores are, we can find the most similar areas for new stores to open.How efficient is a Geohash?Geohash is a faster, cost-effective geofencing option that reduces input data load and provides actionable information. Its benefits include faster querying, reduced cost, minimal configuration, and ease of use.Geohash ranges from 1 to 12 characters. The dataset can be split into variable-size geohashes, with the default being geohash7 (150m x 150m).
The 2006 Second Edition TIGER/Line files are an extract of selected geographic and cartographic information from the Census TIGER database. The geographic coverage for a single TIGER/Line file is a county or statistical equivalent entity, with the coverage area based on the latest available governmental unit boundaries. The Census TIGER database represents a seamless national file with no overlaps or gaps between parts. However, each county-based TIGER/Line file is designed to stand alone as an independent data set or the files can be combined to cover the whole Nation. The 2006 Second Edition TIGER/Line files consist of line segments representing physical features and governmental and statistical boundaries. This shapefile represents the current Traffic Analysis Zones for Torrance County stored in the 2006 TIGER Second Edition dataset.
The 2006 Second Edition TIGER/Line files are an extract of selected geographic and cartographic information from the Census TIGER database. The geographic coverage for a single TIGER/Line file is a county or statistical equivalent entity, with the coverage area based on the latest available governmental unit boundaries. The Census TIGER database represents a seamless national file with no overlaps or gaps between parts. However, each county-based TIGER/Line file is designed to stand alone as an independent data set or the files can be combined to cover the whole Nation. The 2006 Second Edition TIGER/Line files consist of line segments representing physical features and governmental and statistical boundaries. This shapefile represents the current Traffic Analysis Zones for Otero County stored in the 2006 TIGER Second Edition dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Self-Driving Vehicles System: This model can be implemented in autonomous vehicles technology to identify traffic signs and signals, thus enabling the vehicle to make intelligent and safety-compliant decisions as per road conditions.
Smart Traffic Management: The model can be used in urban planning and traffic management systems to analyze, comprehend, and report traffic indications in real-time, aiding in better road traffic control and congestion avoidance.
Driving Assistance Applications: There is potential to integrate this model into GPS navigation systems or dedicated driving assistance applications. These apps could provide real-time traffic rule alerts to drivers, enhancing safety and rule adherence.
Road Condition Analysis: Use the model to collect road condition data based on signs for construction, slippery road, uneven road, etc. This critical information could support road maintenance planning by relevant authorities.
Traffic Rule Training Software: This model can be used in developing training software for beginner drivers or trucking companies. The software could explain and demonstrate various traffic rules, greatly improving the quality of road safety education.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.
This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.
The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.
Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.
In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).
The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.
Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Applications for traffic analysis study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Traffic Surveillance: The computer vision model can be applied to monitor real-time traffic situations. It can identify different vehicle types such as bikes, cars, trucks (ltvs), and any other unusual items on the road, which can help in traffic analysis and management.
Autonomous Vehicles: The model can be integrated into the AI systems of self-driving cars to help them recognize and respond appropriately to the various entities in the outdoor environment, like different types of vehicles, bicycles, and people.
Outdoor Security Systems: This model can enhance the capabilities of security cameras installed outdoors. With its ability to identify various outdoor objects and people, it can improve the effectiveness and responsiveness of such systems.
Pedestrian Safety Application: The model can be integrated into apps designed for enhancing pedestrian safety. These apps can alert users when a vehicle or a bike is approaching.
Smart City Planning: City planners can use the data generated by this model to understand traffic flow, pedestrian activities, and vehicle type distributions, supporting more informed infrastructure planning and development.
Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android´s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.
To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.
In this work, we propose to consider some network layer features as the basis for machine learning models that can successfully detect malware applications, using open datasets from the research community.
This dataset is based on another dataset (DroidCollector) where you can get all the network traffic in pcap files, in our research we preprocessed the files in order to get network features that are illustrated in the next article:
López, C. C. U., Villarreal, J. S. D., Belalcazar, A. F. P., Cadavid, A. N., & Cely, J. G. D. (2018, May). Features to Detect Android Malware. In 2018 IEEE Colombian Conference on Communications and Computing (COLCOM) (pp. 1-6). IEEE.
Cao, D., Wang, S., Li, Q., Cheny, Z., Yan, Q., Peng, L., & Yang, B. (2016, August). DroidCollector: A High Performance Framework for High Quality Android Traffic Collection. In Trustcom/BigDataSE/I SPA, 2016 IEEE (pp. 1753-1758). IEEE
The 2006 Second Edition TIGER/Line files are an extract of selected geographic and cartographic information from the Census TIGER database. The geographic coverage for a single TIGER/Line file is a county or statistical equivalent entity, with the coverage area based on the latest available governmental unit boundaries. The Census TIGER database represents a seamless national file with no overlaps or gaps between parts. However, each county-based TIGER/Line file is designed to stand alone as an independent data set or the files can be combined to cover the whole Nation. The 2006 Second Edition TIGER/Line files consist of line segments representing physical features and governmental and statistical boundaries.
This shapefile represents the current Traffic Analysis Zones for Valencia County stored in the 2006 TIGER Second Edition dataset.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dear Scientist!This database contains data collected due to conducting study: "Analysis of the route safety of abnormal vehicle from the perspective of traffic parameters and infrastructure characteristics with the use of web technologies and machine learning" funded by National Science Centre Poland (Grant reference 2021/05/X/ST8/01669). The structure of files is arising from the aims of the study and numerous of sources needed to tailor suitable data possible to use as an input layer for neural network. You can find a following folders and files:1. Road_Parameters_Data (.csv) - which is data colleced by author before the study (2021). Here you can find information about technical quality and types of main roads located in Mazovia province (Poland). The source of data was Polish General Directorate for National Roads and Motorways. 2. Google_Maps_Data (.json) - here you can find the data, which was collected using the authors’ webservice created using the Python language, which downloaded the said data in the Distance Matrix API service on Google Maps at two-hour intervals from 25 May 2022 to 22 June 2022. The application retrieved the TRAFFIC FACTOR parameter, which was a ratio of actual time of travel divided by historical time of travel for particular roads.3. Geocoding_Roads_Data (.json) - in this folder you can find data gained from reverse geocoding approach based on geographical coordinates and the request parameter latlng were employed. As a result, Google Maps returned a response containing the postal code for the field types defined as postal_code and the name of the lowest possible level of the territorial unit for the field administrative_area_level. 4. Population_Density_Data (.csv) - here you can find date for territorial units, which were assigned to individual records were used to search the database of the Polish Postal Service using the authors' original web service written in the Python programming language. The records which contained a postal code were assigned the name of the municipality which corresponded to it. Finally, postal codes and names of territorial units were compared with the database of the Statistics Poland (GUS) containing information on population density for individual municipalities and assigned to existing records from the database.5. Roads_Incidents_Data (.json) - in this folder you can find a data collected by a webservice, which was programmed in the Python language and used for analysing the reported obstructions available on the website of the General Directorate for National Roads and Motorways. In the event of traffic obstruction emergence in the Mazovia Province, the application, on the basis of the number and kilometre of the road on which it occurred, could associate it later with appropriate records based on the links parameters. The data was colleced from 26 May to 22 June 2022.6. Weather_For_Roads_Data (.json) - here you can find the data concerning the weather conditions on the roads occurring at days of the study. To make this feasible, a webservice was programmed in the Python language, by means of which the selected items from the response returned by the www.timeanddate.com server for the corresponding input parameters were retrieved – geographical coordinates of the midpoint between the nodes of the particular roads. The data was colleced for day between 27 May and 22 June 2022.7. data_v_1 (.csv) - collected only data for road parameters8. data_v_2 (.csv) - collected data for road parameters + population density9. data_v_3 (.json) - collected data for road parameters + population density + traffic10. data_v_4 (.json) - collected data for road parameters + population density + traffic + weather + road incidents11. data_v_5 (.csv) - collected VALIDATED and cleaned data for road parameters + population density + traffic + weather + road incidents. At this stage, the road sections for which the parameter traffic factor was assessed to have been estimated incorrectly were eliminated. These were combinations for which the value of the traffic factor remained the same regardless the time of day or which took several of the same values during the course of the whole study. Moreover, it was also assumed that the final database should consist of road sections for traffic factor less than 1.2 constitute at least 10% of all results. Thus, the sections with no tendency to become congested and characterized by a small number of road traffic users were eliminated.Good luck with your research!Igor Betkier, PhD
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic datasets created by Single Flow Time Series Analysis
Datasets were created for the paper: Network Traffic Classification based on Single Flow Time Series Analysis -- Josef Koumar, Karel Hynek, Tomáš Čejka -- which was published at The 19th International Conference on Network and Service Management (CNSM) 2023. Please cite usage of our datasets as:
J. Koumar, K. Hynek and T. Čejka, "Network Traffic Classification Based on Single Flow Time Series Analysis," 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 2023, pp. 1-7, doi: 10.23919/CNSM59352.2023.10327876.
This Zenodo repository contains 23 datasets created from 15 well-known published datasets which are cited in the table below. Each dataset contains 69 features created by Time Series Analysis of Single Flow Time Series. The detailed description of features from datasets is in the file: feature_description.pdf
In the following table is a description of each dataset file:
File name | Detection problem | Citation of original raw dataset |
botnet_binary.csv | Binary detection of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
botnet_multiclass.csv | Multi-class classification of botnet | S. García et al. An Empirical Comparison of Botnet Detection Methods. Computers & Security, 45:100–123, 2014. |
cryptomining_design.csv | Binary detection of cryptomining; the design part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
cryptomining_evaluation.csv | Binary detection of cryptomining; the evaluation part | Richard Plný et al. Datasets of Cryptomining Communication. Zenodo, October 2022 |
dns_malware.csv | Binary detection of malware DNS | Samaneh Mahdavifar et al. Classifying Malicious Domains using DNS Traffic Analysis. In DASC/PiCom/CBDCom/CyberSciTech 2021, pages 60–67. IEEE, 2021. |
doh_cic.csv | Binary detection of DoH |
Mohammadreza MontazeriShatoori et al. Detection of doh tunnels using time-series classification of encrypted traffic. In DASC/PiCom/CBDCom/CyberSciTech 2020, pages 63–70. IEEE, 2020 |
doh_real_world.csv | Binary detection of DoH | Kamil Jeřábek et al. Collection of datasets with DNS over HTTPS traffic. Data in Brief, 42:108310, 2022 |
dos.csv | Binary detection of DoS | Nickolaos Koroniotis et al. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst., 100:779–796, 2019. |
edge_iiot_binary.csv | Binary detection of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
edge_iiot_multiclass.csv | Multi-class classification of IoT malware | Mohamed Amine Ferrag et al. Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications: Centralized and federated learning, 2022. |
https_brute_force.csv | Binary detection of HTTPS Brute Force | Jan Luxemburk et al. HTTPS Brute-force dataset with extended network flows, November 2020 |
ids_cic_binary.csv | Binary detection of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_cic_multiclass.csv | Multi-class classification of intrusion in IDS | Iman Sharafaldin et al. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018. |
ids_unsw_nb_15_binary.csv | Binary detection of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
ids_unsw_nb_15_multiclass.csv | Multi-class classification of intrusion in IDS | Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In 2015 military communications and information systems conference (MilCIS), pages 1–6. IEEE, 2015. |
iot_23.csv | Binary detection of IoT malware | Sebastian Garcia et al. IoT-23: A labeled dataset with malicious and benign IoT network traffic, January 2020. More details here https://www.stratosphereips.org /datasets-iot23 |
ton_iot_binary.csv | Binary detection of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
ton_iot_multiclass.csv | Multi-class classification of IoT malware | Nour Moustafa. A new distributed architecture for evaluating ai-based security systems at the edge: Network ton iot datasets. Sustainable Cities and Society, 72:102994, 2021 |
tor_binary.csv | Binary detection of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
tor_multiclass.csv | Multi-class classification of TOR | Arash Habibi Lashkari et al. Characterization of Tor Traffic using Time based Features. In ICISSP 2017, pages 253–262. SciTePress, 2017. |
vpn_iscx_binary.csv | Binary detection of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_iscx_multiclass.csv | Multi-class classification of VPN | Gerard Draper-Gil et al. Characterization of Encrypted and VPN Traffic Using Time-related. In ICISSP, pages 407–414, 2016. |
vpn_vnat_binary.csv | Binary detection of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |
vpn_vnat_multiclass.csv | Multi-class classification of VPN | Steven Jorgensen et al. Extensible Machine Learning for Encrypted Network Traffic Application Labeling via Uncertainty Quantification. CoRR, abs/2205.05628, 2022 |