Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.
The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".
The data directory is structured as follows. Each directory mail..com contains the logs of one web server. Each directory user- contains the logs of one user host machine, where one or more users are simulated. Each file log.log in the user- directories contains the activity logs of one particular user.
Setup details of the web servers:
OS: Debian Stretch 9.11.6
Services:
Apache2
PHP7
Exim 4.89
Horde 5.2.22
OkayCMS 2.3.4
Suricata
ClamAV
MariaDB
Setup details of user machines:
OS: Ubuntu Bionic
Services:
Chromium
Firefox
User host machines are assigned to web servers in the following way:
mail.cup.com is accessed by users from host machines user-{0, 1, 2, 6}
mail.spiral.com is accessed by users from host machines user-{3, 5, 8}
mail.insect.com is accessed by users from host machines user-{4, 9}
mail.onion.com is accessed by users from host machines user-{7, 10}
The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):
Attack 1: multi-step attack with sequential execution of the following attacks:
nmap scan
nikto scan
smtp-user-enum tool for account enumeration
hydra brute force login
webshell upload through Horde exploit (CVE-2019-9858)
privilege escalation through Exim exploit (CVE-2019-10149)
Attack 2: webshell injection through malicious cookie (CVE-2019-16885)
Attacks are launched from the following user host machines. In each of the corresponding directories user-, logs of the attack execution are found in the file attackLog.txt:
user-6 attacks mail.cup.com
user-5 attacks mail.spiral.com
user-4 attacks mail.insect.com
user-7 attacks mail.onion.com
The log data collected from the web servers includes
Apache access and error logs
syscall logs collected with the Linux audit daemon
suricata logs
exim logs
auth logs
daemon logs
mail logs
syslogs
user logs
Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.
Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.
Version history and related data sets:
AIT-LDS-v1.0: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
AIT-LDS-v1.1: Removed carriage return of line endings in audit.log files.
AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).
If you use the dataset, please cite the following publication:
[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains server logs from the search engine of the library and information center of the Aristotle University of Thessaloniki in Greece (http://search.lib.auth.gr/). The search engine enables users to check the availability of books and other written works, and search for digitized material and scientific publications. The server logs obtained span an entire month, from March 1st to March 31 2018 and consist of 4,091,155 requests with an average of 131,973 requests per day and a standard deviation of 36,996.7 requests. In total, there are requests from 27,061 unique IP addresses and 3,441 unique user-agent strings. The server logs are in JSON format and they are anonymized by masking the last 6 digits of the IP address and by hashing the last part of the URLs requested (after last /). The dataset also contains the processed form of the server logs as a labelled dataset of log entries grouped into sessions along with their extracted features (simple semantic features). We make this dataset publicly available, the first one in this domain, in order to provide a common ground for testing web robot detection methods, as well as other methods that analyze server logs.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network. Note that the AIT-LDSv2.0 extends this dataset with additional attack cases and variations of attack parameters.
The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.
Each dataset contains traces of a specific attack scenario:
The log data collected from the servers includes
Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).
If you use the dataset, please cite the following publications:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
gmt
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.
In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.
The datasets in this repository have the following structure:
The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather//logs/.
The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.
The processing directory contains the source code that was used to generate the labels.
The rules directory contains the labeling rules.
The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.
The dataset.yml file specifies the start and end time of the simulation.
The following table summarizes relevant properties of the datasets:
fox
Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00
Attack time: 2022-01-18 11:59 - 2022-01-18 13:15
Scan volume: High
Unpacked size: 26 GB
harrison
Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00
Attack time: 2022-02-08 07:07 - 2022-02-08 08:38
Scan volume: High
Unpacked size: 27 GB
russellmitchell
Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00
Attack time: 2022-01-24 03:01 - 2022-01-24 04:39
Scan volume: Low
Unpacked size: 14 GB
santos
Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00
Attack time: 2022-01-17 11:15 - 2022-01-17 11:59
Scan volume: Low
Unpacked size: 17 GB
shaw
Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00
Attack time: 2022-01-29 14:37 - 2022-01-29 15:21
Scan volume: Low
Data exfiltration is not visible in DNS logs
Unpacked size: 27 GB
wardbeck
Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00
Attack time: 2022-01-23 12:10 - 2022-01-23 12:56
Scan volume: Low
Unpacked size: 26 GB
wheeler
Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00
Attack time: 2022-01-30 07:35 - 2022-01-30 17:53
Scan volume: High
No password cracking in attack chain
Unpacked size: 30 GB
wilson
Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00
Attack time: 2022-02-07 10:57 - 2022-02-07 11:49
Scan volume: High
Unpacked size: 39 GB
The following attacks are launched in the network:
Scans (nmap, WPScan, dirb)
Webshell upload (CVE-2020-24186)
Password cracking (John the Ripper)
Privilege escalation
Remote command execution
Data exfiltration (DNSteal)
Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.
The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:
{"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
{"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}
Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:
type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'
The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.
Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather//configs/ and gather//facts.json.
Version history:
AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).
If you use the dataset, please cite the following publications:
[1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482, doi: 10.1109/TDSC.2022.3201582. [PDF]
[2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises diverse logs from various sources, including cloud services, routers, switches, virtualization, network security appliances, authentication systems, DNS, operating systems, packet captures, proxy servers, servers, syslog data, and network data. The logs encompass a wide range of information such as traffic details, user activities, authentication events, DNS queries, network flows, security actions, and system events. By analyzing these logs collectively, users can gain insights into network patterns, anomalies, user authentication, cloud service usage, DNS traffic, network flows, security incidents, and system activities. The dataset is invaluable for network monitoring, performance analysis, anomaly detection, security investigations, and correlating events across the entire network infrastructure.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset represents the pre-processed web server log file of the commercial bank. The source of data is the web server of the bank and keeps access of web users starting the year 2009 till 2012. It contains accesses to the bank website during and after the financial crisis. Unnecessary data saved by the web server was removed to keep the focus only on the textual content of the website. Many variables were added to the original log file to make the analysis workable. To keep the privacy of website users, sensitive information in the log file were anonymized. The dataset offers the way to understand the behaviour of stakeholders during and after the crisis and how they comply with the Basel regulations.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by IlifyDev
Released under MIT
This dataset was created by AYousry
This dataset, containing a topological analysis of server logs, was created in a project that aimed at documenting the behavior of scientists on online platforms by making sense of the digital trace they generate while navigating. The repository contains the Jupyter notebook that was run on the cluster, its aim was to construct the sessions from the large data provided by Gallica user navigations, the Jupyter notebook that contains topological data analysis and cluster visualizations and the final report of the project.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For the evaluation of OS fingerprinting methods, we need a dataset with the following requirements:
First, the dataset needs to be big enough to capture the variability of the data. In this case, we need many connections from different operating systems.
Second, the dataset needs to be annotated, which means that the corresponding operating system needs to be known for each network connection captured in the dataset. Therefore, we cannot just capture any network traffic for our dataset; we need to be able to determine the OS reliably.
To overcome these issues, we have decided to create the dataset from the traffic of several web servers at our university. This allows us to address the first issue by collecting traces from thousands of devices ranging from user computers and mobile phones to web crawlers and other servers. The ground truth values are obtained from the HTTP User-Agent, which resolves the second of the presented issues. Even though most traffic is encrypted, the User-Agent can be recovered from the web server logs that record every connection’s details. By correlating the IP address and timestamp of each log record to the captured traffic, we can add the ground truth to the dataset.
For this dataset, we have selected a cluster of five web servers that host 475 unique university domains for public websites. The monitoring point recording the traffic was placed at the backbone network connecting the university to the Internet.
The dataset used in this paper was collected from approximately 8 hours of university web traffic throughout a single workday. The logs were collected from Microsoft IIS web servers and converted from W3C extended logging format to JSON. The logs are referred to as web logs and are used to annotate the records generated from packet capture obtained by using a network probe tapped into the link to the Internet.
The entire dataset creation process consists of seven steps:
The packet capture was processed by the Flowmon flow exporter (https://www.flowmon.com) to obtain primary flow data containing information from TLS and HTTP protocols.
Additional statistical features were extracted using GoFlows flow exporter (https://github.com/CN-TU/go-flows).
The primary flows were filtered to remove incomplete records and network scans.
The flows from both exporters were merged together into records containing fields from both sources.
Web logs were filtered to cover the same time frame as the flow records.
Web logs were paired with the flow records based on shared properties (IP address, port, time).
The last step was to convert the User-Agent values into the operating system using a Python version of the open-source tool ua-parser (https://github.com/ua-parser/uap-python). We replaced the unstructured User-Agent string in the records with the resulting OS.
The collected and enriched flows contain 111 data fields that can be used as features for OS fingerprinting or any other data analyses. The fields grouped by their area are listed below:
basic flow properties - flow_ID;start;end;L3 PROTO;L4 PROTO;BYTES A;PACKETS A;SRC IP;DST IP;TCP flags A;SRC port;DST port;packetTotalCountforward;packetTotalCountbackward;flowDirection;flowEndReason;
IP parameters - IP ToS;maximumTTLforward;maximumTTLbackward;IPv4DontFragmentforward;IPv4DontFragmentbackward;
TCP parameters - TCP SYN Size;TCP Win Size;TCP SYN TTL;tcpTimestampFirstPacketbackward;tcpOptionWindowScaleforward;tcpOptionWindowScalebackward;tcpOptionSelectiveAckPermittedforward;tcpOptionSelectiveAckPermittedbackward;tcpOptionMaximumSegmentSizeforward;tcpOptionMaximumSegmentSizebackward;tcpOptionNoOperationforward;tcpOptionNoOperationbackward;synAckFlag;tcpTimestampFirstPacketforward;
HTTP - HTTP Request Host;URL;
User-agent - UA OS family;UA OS major;UA OS minor;UA OS patch;UA OS patch minor;
TLS - TLS_CONTENT_TYPE;TLS_HANDSHAKE_TYPE;TLS_SETUP_TIME;TLS_SERVER_VERSION;TLS_SERVER_RANDOM;TLS_SERVER_SESSION_ID;TLS_CIPHER_SUITE;TLS_ALPN;TLS_SNI;TLS_SNI_LENGTH;TLS_CLIENT_VERSION;TLS_CIPHER_SUITES;TLS_CLIENT_RANDOM;TLS_CLIENT_SESSION_ID;TLS_EXTENSION_TYPES;TLS_EXTENSION_LENGTHS;TLS_ELLIPTIC_CURVES;TLS_EC_POINT_FORMATS;TLS_CLIENT_KEY_LENGTH;TLS_ISSUER_CN;TLS_SUBJECT_CN;TLS_SUBJECT_ON;TLS_VALIDITY_NOT_BEFORE;TLS_VALIDITY_NOT_AFTER;TLS_SIGNATURE_ALG;TLS_PUBLIC_KEY_ALG;TLS_PUBLIC_KEY_LENGTH;TLS_JA3_FINGERPRINT;
Packet timings - NPM_CLIENT_NETWORK_TIME;NPM_SERVER_NETWORK_TIME;NPM_SERVER_RESPONSE_TIME;NPM_ROUND_TRIP_TIME;NPM_RESPONSE_TIMEOUTS_A;NPM_RESPONSE_TIMEOUTS_B;NPM_TCP_RETRANSMISSION_A;NPM_TCP_RETRANSMISSION_B;NPM_TCP_OUT_OF_ORDER_A;NPM_TCP_OUT_OF_ORDER_B;NPM_JITTER_DEV_A;NPM_JITTER_AVG_A;NPM_JITTER_MIN_A;NPM_JITTER_MAX_A;NPM_DELAY_DEV_A;NPM_DELAY_AVG_A;NPM_DELAY_MIN_A;NPM_DELAY_MAX_A;NPM_DELAY_HISTOGRAM_1_A;NPM_DELAY_HISTOGRAM_2_A;NPM_DELAY_HISTOGRAM_3_A;NPM_DELAY_HISTOGRAM_4_A;NPM_DELAY_HISTOGRAM_5_A;NPM_DELAY_HISTOGRAM_6_A;NPM_DELAY_HISTOGRAM_7_A;NPM_JITTER_DEV_B;NPM_JITTER_AVG_B;NPM_JITTER_MIN_B;NPM_JITTER_MAX_B;NPM_DELAY_DEV_B;NPM_DELAY_AVG_B;NPM_DELAY_MIN_B;NPM_DELAY_MAX_B;NPM_DELAY_HISTOGRAM_1_B;NPM_DELAY_HISTOGRAM_2_B;NPM_DELAY_HISTOGRAM_3_B;NPM_DELAY_HISTOGRAM_4_B;NPM_DELAY_HISTOGRAM_5_B;NPM_DELAY_HISTOGRAM_6_B;NPM_DELAY_HISTOGRAM_7_B;
ICMP - ICMP TYPE;
The details of OS distribution grouped by the OS family are summarized in the table below. The Other OS family contains records generated by web crawling bots that do not include OS information in the User-Agent.
OS Family
Number of flows
Other
42474
Windows
40349
Android
10290
iOS
8840
Mac OS X
5324
Linux
1589
Ubuntu
653
Fedora
88
Chrome OS
53
Symbian OS
1
Slackware
1
Linux Mint
1
The UAS User Log is a server-based, digital logbook that is accessible through any web browser on internet-connected devices. It is an outcome of multi-state teams working together to develop a common protocol for unmanned aircraft systems (UAS, or drones) operation and bring standardization to flight data collection for purposes such as research/production, spray application, and any other activity of interest. It relies on simple user interactions to develop a record of UAS mission and can also serve to enhance flight and maintenance experience. The logbook provides options to interactively record the date, time and location of a flight, the make, model and registration information of the device, status of battery charge, type of flight (autonomous or manual), types of sensors used and data collected, safety precautions taken, weather during the flight and other related information. Resources in this dataset:Resource Title: Website Pointer to UAS User Log. File Name: Web Page, url: https://www.uasuserlog.org/ Generates a web form to log details of a specific UAS mission.
Library of Wroclaw University of Science and Technology scientific output (DONA database)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the AIT Alert Data Set (AIT-ADS), a collection of synthetic alerts suitable for evaluation of alert aggregation, alert correlation, alert filtering, and attack graph generation approaches. The alerts were forensically generated from the AIT Log Data Set V2 (AIT-LDSv2) and origin from three intrusion detection systems, namely Suricata, Wazuh, and AMiner. The data sets comprise eight scenarios, each of which has been targeted by a multi-step attack with attack steps such as scans, web application exploits, password cracking, remote command execution, privilege escalation, etc. Each scenario and attack chain has certain variations so that attack manifestations and resulting alert sequences vary in each scenario; this means that the data set allows to develop and evaluate approaches that compute similarities of attack chains or merge them into meta-alerts. Since only few benchmark alert data sets are publicly available, the AIT-ADS was developed to address common issues in the research domain of multi-step attack analysis; specifically, the alert data set contains many false positives caused by normal user behavior (e.g., user login attempts or software updates), heterogeneous alert formats (although all alerts are in JSON format, their fields are different for each IDS), repeated executions of attacks according to an attack plan, collection of alerts from diverse log sources (application logs and network traffic) and all components in the network (mail server, web server, DNS, firewall, file share, etc.), and labels for attack phases. For more information on how this alert data set was generated, check out our paper accompanying this data set [1] or our GitHub repository. More information on the original log data set, including a detailed description of scenarios and attacks, can be found in [2].
The alert data set contains two files for each of the eight scenarios, and a file for their labels:
Beside false positive alerts, the alerts in the AIT-ADS correspond to the following attacks:
The total number of alerts involved in the data set is 2,655,821, of which 2,293,628 origin from Wazuh, 306,635 origin from Suricata, and 55,558 origin from AMiner. The numbers of alerts in each scenario are as follows. fox: 473,104; harrison: 593,948; russellmitchell: 45,544; santos: 130,779; shaw: 70,782; wardbeck: 91,257; wheeler: 616,161; wilson: 634,246.
Acknowledgements: Partially funded by the European Defence Fund (EDF) projects AInception (101103385) and NEWSROOM (101121403), and the FFG project PRESENT (FO999899544). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. The European Union cannot be held responsible for them.
If you use the AIT-ADS, please cite the following publications:
[1] Landauer, M., Skopik, F., Wurzenberger, M. (2024): Introducing a New Alert Data Set for Multi-Step Attack Analysis. Proceedings of the 17th Cyber Security Experimentation and Test Workshop. [PDF]
[2] Landauer M., Skopik F., Frank M., Hotwagner W., Wurzenberger M., Rauber A. (2023): Maintainable Log Datasets for Evaluation of Intrusion Detection Systems. IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482. [PDF]
https://doi.org/10.17026/fp39-0x58https://doi.org/10.17026/fp39-0x58
We have collected the access logs for our university's web domain over a time span of 4.5 years. We now release the pre-processed web server log of a 3-month period for research into user navigation behavior. We preprocessed the data so that only successful GET requests of web pages by non-bot users are kept. The information that is included per entry is: unique user id, timestamp, GET request (URL), status code, the size of the object returned to the client, and the referrer URL. The resulting size of the 3-month collection is 9.6M page visits (190K unique URLs) by 744K unique visitors. The data collection allows for research on, among other things, user navigation, browsing and stopping behavior and web user clustering. Date Submitted: 2016-04-28
This dataset collects the daily requests sent to the Navitia route planner during 2022 and the first quarter of 2023 starting or ending in the Île-de-France region. The responses included in the dataset are not real journeys but routes proposed by the planner. Every route includes the start and end points as well as some transit points.
Since this information could jeopardize the privacy of the users, an anonymization procedure was applied. The original dataset was anonymized using the anonymizer module developed within the Mobidatalab project. The module includes several anonymization methods, privacy-preserving analysis method, and methods to compute different utility and privacy metrics. It also provides a command line interface (CLI) that allows users to use all the module’s functionalities in a straightforward way. The module is also ready to be deployed in a server and to process requests through an API.
The dataset was anonymized using the “Time partition Microaggregation” method, a version of the well-known microaggregation method for very large mobility datasets where the application of microaggregation would not be feasible. A detailed description of the method can be found here.
Specific parameters: • K: 10 • Interval: 3600 seconds • Clustering_method: MDAV • Agregation method: Mean trajectory
Along with the daily datasets, origin-destination matrixes are also included. These were computed using a tessellation corresponding to the French zip codes. A geojson file with the layout of the postal codes of the region is also included.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CTU Hornet 65 Niner is a dataset of 65 days of network traffic attacks captured in cloud servers used as honeypots to help understand how geography may impact the inflow of network attacks. The honeypots were placed in nine different geographical locations: Amsterdam, London, Frankfurt, San Francisco, New York, Singapore, Toronto, Bangalore, and Sydney. The data was captured from April 28th to July 1st, 2024.
The nine cloud servers were created and configured following identical instructions using Ansible [1] in DigitalOcean [2] cloud provider. The network capture was performed using the Zeek [3] network monitoring tool, which was installed on each cloud server. The cloud servers had only one service running (SSH on a non-standard port) and were fully dedicated to being used as a honeypot. No honeypot software was used in this dataset.
The dataset is composed of nine scenarios:
References:
[1] Ansible IT Automation Engine, https://www.ansible.com/. Accessed on 08/28/2024.
[2] DigitalOcean, https://www.digitalocean.com/. Accessed on 08/28/2024.
[3] Zeek Documentation, https://docs.zeek.org/en/master/index.html. Accessed on 08/28/2024.
Funding:
The authors acknowledge support by the Strategic Support for the Development of Security Research in the Czech Republic 2019--2025 (IMPAKT 1) program, by the Ministry of the Interior of the Czech Republic under No. VJ02010020 -- AI-Dojo: Multi-agent testbed for the research and testing of AI-driven cyber security technologies.
This resource is a metadata compilation for geothermal related resource records in data exchange content models submitted by Nevada as their deliverables under the AASG NGDS project (2010-2014) for inclusion in the NGDS Catalog. The content model defines the information that will be associated with a feature or observation type; the content model may be implemented in a variety of ways, but USGIN is currently implementing these interchange formats as GML Simple Features to be served by an OGC WFS. Data is available in an Excel workbook, ESRI Map Server, Web Map Service, and Web Feature Service with appropriate ResourceURLs listed for each record.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset below consists of subsystem initialization timings, which is the time taken to fully initialize each subsystem at the start of a round of /tg/station's space station 13.
This data is composed of the subsystem initialization timings from rounds that began from October 6 to December 5, 2020. The unit of time is seconds unless otherwise stated.
Thanks to /tg/station for providing the public log data upon which this data set has been made possible.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This web server, run by the West Virginia Geological and Economic Survey, contains data on: WVGES Scanned Well Logs WVGES Digitized Logs WVGES Slabbed Core Photographs WVGES Rock Information (includes Well Sample Descriptions) WVGES E-Files (Scanned Plats, Completions, etc)
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.
The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".
The data directory is structured as follows. Each directory mail..com contains the logs of one web server. Each directory user- contains the logs of one user host machine, where one or more users are simulated. Each file log.log in the user- directories contains the activity logs of one particular user.
Setup details of the web servers:
OS: Debian Stretch 9.11.6
Services:
Apache2
PHP7
Exim 4.89
Horde 5.2.22
OkayCMS 2.3.4
Suricata
ClamAV
MariaDB
Setup details of user machines:
OS: Ubuntu Bionic
Services:
Chromium
Firefox
User host machines are assigned to web servers in the following way:
mail.cup.com is accessed by users from host machines user-{0, 1, 2, 6}
mail.spiral.com is accessed by users from host machines user-{3, 5, 8}
mail.insect.com is accessed by users from host machines user-{4, 9}
mail.onion.com is accessed by users from host machines user-{7, 10}
The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):
Attack 1: multi-step attack with sequential execution of the following attacks:
nmap scan
nikto scan
smtp-user-enum tool for account enumeration
hydra brute force login
webshell upload through Horde exploit (CVE-2019-9858)
privilege escalation through Exim exploit (CVE-2019-10149)
Attack 2: webshell injection through malicious cookie (CVE-2019-16885)
Attacks are launched from the following user host machines. In each of the corresponding directories user-, logs of the attack execution are found in the file attackLog.txt:
user-6 attacks mail.cup.com
user-5 attacks mail.spiral.com
user-4 attacks mail.insect.com
user-7 attacks mail.onion.com
The log data collected from the web servers includes
Apache access and error logs
syscall logs collected with the Linux audit daemon
suricata logs
exim logs
auth logs
daemon logs
mail logs
syslogs
user logs
Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.
Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.
Version history and related data sets:
AIT-LDS-v1.0: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
AIT-LDS-v1.1: Removed carriage return of line endings in audit.log files.
AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).
If you use the dataset, please cite the following publication:
[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]