1 dataset found

CESNET-TLS22: A large dataset for fine-grained classification of TLS...
zenodo.org
csv, zip
Updated Feb 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luxemburk Jan; Luxemburk Jan; Čejka Tomáš; Čejka Tomáš (2024). CESNET-TLS22: A large dataset for fine-grained classification of TLS services [Dataset]. http://doi.org/10.5281/zenodo.7965515
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7965515
Dataset updated
Feb 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luxemburk Jan; Luxemburk Jan; Čejka Tomáš; Čejka Tomáš
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please refer to the original article for further data description: Jan Luxemburk et al. Fine-grained TLS services classification with reject option, Computer Networks, 2023, 109467, ISSN 1389-1286, https://doi.org/10.1016/j.comnet.2022.109467
We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo.
The recent success and proliferation of machine learning and deep learning have provided powerful tools, which are also utilized for encrypted traffic analysis, classification, and threat detection. These methods, neural networks in particular, are often complex and require a huge corpus of training data. Moreover, because most of the network traffic is being encrypted, the traditional deep-packet-inspecting (DPI) solutions are becoming obsolete, and there is an urgent need for modern classification methods capable of analyzing encrypted traffic. These methods have to forgo the packet's opaque payload and focus on flow statistics and packet metadata sequences like packet sizes, directions, and inter-arrival times. The classification can be further extended with the task of "rejecting" unknown traffic, i.e., the traffic not seen during the training phase. This makes the problem more challenging, and neural networks offer superior performance for tackling this problem. When the factors of (1) the hardness of classification of encrypted traffic with unknown traffic detection and (2) the neural networks' inherent need for large datasets are combined, the requirement for a rich, large, and up-to-date dataset is even stronger.
Therefore, we created a large dataset spanning two weeks, consisting of 141 million network flows, and having 191 fine-grained service labels. The dataset is intended as a benchmark for the task of identification of services in encrypted traffic with the detection of unknown services.
Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for two weeks between 4.10.2021 and 17.10.2021. The following table provides per-week flow count, capture period, and uncompressed size:
W-2021-40
Uncompressed Size: 22 GB
Capture Period: 4.10.2021 - 10.10.2021
Flows: 73.2M
W-2021-41
Uncompressed Size: 20 GB
Capture Period: 11.10.2021 - 17.10.2021
Flows: 68.5M
CESNET-TLS22
Uncompressed Size: 42 GB
Capture Period: 4.10.2021 - 17.10.2021
Flows: 141.7M
Dataset structure The dataset flows are delivered in compressed CSV files, which contain one flow per row. For each flow data file, there is a JSON file with the number of saved flows per service. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following table describes flow data fields in CSV files:
ID: Unique identifier
BYTES: Number of transmitted bytes from client to server
BYTES_REV: Number of transmitted bytes from server to client
PACKETS: Number of packets transmitted from client to server
PACKETS_REV: Number of packets transmitted from server to client
DURATION: Duration of the flow in seconds
PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]]
PPI_LEN: Number of packets in the PPI sequence
PPI_DURATION: Duration of the PPI sequence in seconds
PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence
APP: Web service label
CATEGORY: Service category
TCP_FLAGS: TCP flags sent from client to server
TCP_FLAGS_REV: TCP flags sent from server to client
FLAG_CWR: Presence of the CWR flag
FLAG_CWR_REV: Presence of the CWR flag in the reverse direction
FLAG_ECE: Presence of the ECE flag
FLAG_ECE_REV: Presence of the ECE flag in the reverse direction
FLAG_URG: Presence of the URG flag
FLAG_URG_REV: Presence of the URG flag in the reverse direction
FLAG_ACK: Presence of the ACK flag
FLAG_ACK_REV: Presence of the ACK flag in the reverse direction
FLAG_PSH: Presence of the PSH flag
FLAG_PSH_REV: Presence of the PSH flag in the reverse direction
FLAG_RST: Presence of the RST flag
FLAG_RST_REV: Presence of the RST flag in the reverse direction
FLAG_SYN: Presence of the SYN flag
FLAG_SYN_REV: Presence of the SYN flag in the reverse direction
FLAG_FIN: Presence of the FIN flag
FLAG_FIN_REV: Presence of the FIN flag in the reverse direction
Link to other CESNET datasets
https://www.liberouter.org/technology-v2/tools-services-datasets/datasets/
https://github.com/CESNET/cesnet-datazoo
Please cite the original article:
@article{luxemburk_fine-grained-tls_2023, author = {Jan Luxemburk and Tomáš Čejka}, title = {Fine-grained TLS services classification with reject option}, journal = {Computer Networks}, volume = {220}, pages = {109467}, year = {2023}, issn = {1389-1286}, doi = {https://doi.org/10.1016/j.comnet.2022.109467}, url = {https://www.sciencedirect.com/science/article/pii/S1389128622005011} }
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Luxemburk Jan; Luxemburk Jan; Čejka Tomáš; Čejka Tomáš (2024). CESNET-TLS22: A large dataset for fine-grained classification of TLS services [Dataset]. http://doi.org/10.5281/zenodo.7965515

CESNET-TLS22: A large dataset for fine-grained classification of TLS services

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7965515

Dataset updated

Feb 6, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Luxemburk Jan; Luxemburk Jan; Čejka Tomáš; Čejka Tomáš

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Please refer to the original article for further data description: Jan Luxemburk et al. Fine-grained TLS services classification with reject option, Computer Networks, 2023, 109467, ISSN 1389-1286, https://doi.org/10.1016/j.comnet.2022.109467

We recommend using the CESNET DataZoo python library, which facilitates the work with large network traffic datasets. More information about the DataZoo project can be found in the GitHub repository https://github.com/CESNET/cesnet-datazoo.

The recent success and proliferation of machine learning and deep learning have provided powerful tools, which are also utilized for encrypted traffic analysis, classification, and threat detection. These methods, neural networks in particular, are often complex and require a huge corpus of training data. Moreover, because most of the network traffic is being encrypted, the traditional deep-packet-inspecting (DPI) solutions are becoming obsolete, and there is an urgent need for modern classification methods capable of analyzing encrypted traffic. These methods have to forgo the packet's opaque payload and focus on flow statistics and packet metadata sequences like packet sizes, directions, and inter-arrival times. The classification can be further extended with the task of "rejecting" unknown traffic, i.e., the traffic not seen during the training phase. This makes the problem more challenging, and neural networks offer superior performance for tackling this problem. When the factors of (1) the hardness of classification of encrypted traffic with unknown traffic detection and (2) the neural networks' inherent need for large datasets are combined, the requirement for a rich, large, and up-to-date dataset is even stronger.

Therefore, we created a large dataset spanning two weeks, consisting of 141 million network flows, and having 191 fine-grained service labels. The dataset is intended as a benchmark for the task of identification of services in encrypted traffic with the detection of unknown services.

Data capture The data was captured in the flow monitoring infrastructure of the CESNET2 network. The capturing was done for two weeks between 4.10.2021 and 17.10.2021. The following table provides per-week flow count, capture period, and uncompressed size:

W-2021-40
- Uncompressed Size: 22 GB
- Capture Period: 4.10.2021 - 10.10.2021
- Flows: 73.2M
W-2021-41
- Uncompressed Size: 20 GB
- Capture Period: 11.10.2021 - 17.10.2021
- Flows: 68.5M
CESNET-TLS22
- Uncompressed Size: 42 GB
- Capture Period: 4.10.2021 - 17.10.2021
- Flows: 141.7M

Dataset structure The dataset flows are delivered in compressed CSV files, which contain one flow per row. For each flow data file, there is a JSON file with the number of saved flows per service. There is also the stats-week.json file aggregating flow counts of a whole week and the stats-dataset.json file aggregating flow counts for the entire dataset. The mapping between services and service providers is provided in the servicemap.csv file, which also includes SNI domains used for ground truth labeling. The following table describes flow data fields in CSV files:

ID: Unique identifier
BYTES: Number of transmitted bytes from client to server
BYTES_REV: Number of transmitted bytes from server to client
PACKETS: Number of packets transmitted from client to server
PACKETS_REV: Number of packets transmitted from server to client
DURATION: Duration of the flow in seconds
PPI: Packet metadata sequence in the format: [[inter-packet times], [packet directions], [packet sizes]]
PPI_LEN: Number of packets in the PPI sequence
PPI_DURATION: Duration of the PPI sequence in seconds
PPI_ROUNDTRIPS: Number of roundtrips in the PPI sequence
APP: Web service label
CATEGORY: Service category
TCP_FLAGS: TCP flags sent from client to server
TCP_FLAGS_REV: TCP flags sent from server to client
FLAG_CWR: Presence of the CWR flag
FLAG_CWR_REV: Presence of the CWR flag in the reverse direction
FLAG_ECE: Presence of the ECE flag
FLAG_ECE_REV: Presence of the ECE flag in the reverse direction
FLAG_URG: Presence of the URG flag
FLAG_URG_REV: Presence of the URG flag in the reverse direction
FLAG_ACK: Presence of the ACK flag
FLAG_ACK_REV: Presence of the ACK flag in the reverse direction
FLAG_PSH: Presence of the PSH flag
FLAG_PSH_REV: Presence of the PSH flag in the reverse direction
FLAG_RST: Presence of the RST flag
FLAG_RST_REV: Presence of the RST flag in the reverse direction
FLAG_SYN: Presence of the SYN flag
FLAG_SYN_REV: Presence of the SYN flag in the reverse direction
FLAG_FIN: Presence of the FIN flag
FLAG_FIN_REV: Presence of the FIN flag in the reverse direction

Link to other CESNET datasets

Please cite the original article:

@article{luxemburk_fine-grained-tls_2023, author = {Jan Luxemburk and Tomáš Čejka}, title = {Fine-grained TLS services classification with reject option}, journal = {Computer Networks}, volume = {220}, pages = {109467}, year = {2023}, issn = {1389-1286}, doi = {https://doi.org/10.1016/j.comnet.2022.109467}, url = {https://www.sciencedirect.com/science/article/pii/S1389128622005011} }

Clear search

Close search

Google apps

Main menu

CESNET-TLS22: A large dataset for fine-grained classification of TLS...

CESNET-TLS22: A large dataset for fine-grained classification of TLS services